The present disclosure relates to a method of distinguishing active TB in the presence of a complicating factor, for example, latent TB and/or co-morbidities, such as those that present similar symptoms to TB. The disclosure also relates to a gene signature employed in the said method and to a bespoke gene chip for use in the method. The disclosure further relates to use of known gene chips in the methods of the disclosure and kits comprising the elements required for performing the method. The disclosure also relates to use of the method to provide a composite expression score which can be used in the diagnosis of TB, particularly in a low resource setting.
An estimated 8.8 million new cases and 1.45 million deaths are caused by Tuberculosis, TB (short for tubercle bacillus) each year (World Health Organisation statistics 2011). TB is an infectious disease caused by various species of mycobacteria, typically Mycobacterium tuberculosis. Tuberculosis usually attacks the lungs but can also affect other parts of the body. It is spread through the air when people who have an active TB infection cough, sneeze, or otherwise transmit their saliva. Most infections in humans result in an asymptomatic, latent infection, and about one in ten latent infections eventually progress to active disease, which, if left untreated, kills more than 50% of those infected. Immunosuppression and malnutrition are among the risk factors for developing active TB.
The classic symptoms are a chronic cough with blood-tinged sputum, fever, night sweats, and weight loss (the latter giving rise to the formerly prevalent colloquial term “consumption”). Infection of organs other than the lungs causes a wide range of symptoms. Treatment is difficult and requires long courses of multiple antibiotics. Antibiotic resistance is a growing problem with numbers of multi-drug-resistant tuberculosis cases on the rise. This is, in part, due to the length of treatment needed. Those infected with latent TB are typically asymptomatic and therefore either forget or decided not to take antibiotics. Those infected with active TB often cease treatment when the symptoms clear even though the infection remains.
Correct diagnosis is of utmost importance in the treatment of TB. The treatment regimens for active TB and latent TB are different and so it is important to diagnose the two conditions correctly in order to provide appropriate therapy.
Diagnosis of TB is particularly complicated as it cannot solely be based on symptoms. This is for two reasons: those infected with latent TB exhibit no symptoms and active TB may present similar symptoms to other infections or illnesses. Matters may be further complicated by the fact that TB may not be the only infection or illness that the patient has. Co-morbidities and co-infections often mask the symptoms of active TB and thus the latter goes undiagnosed and untreated. If active TB goes untreated the patient has a high probability of death due to the disease. Not only does TB present similar symptoms to other infectious or non-infectious conditions but it also presents similar radiological features. Thus identifying the presence of TB definitively can be difficult.
Diagnosis is therefore multi-facetted, relying on clinical and radiological features (commonly chest X-rays), sputum microscopy (with or without culture), tuberculin skin test (TST), blood tests, as well as microscopic examination and microbiological culture of bodily fluids. In many places, such as Africa, which often do not have the resources needed to make a full diagnosis, this is a major impediment to tuberculosis treatment and control. Culture facilities are largely unavailable for TB diagnosis in most African hospitals.
All of the known methods of diagnosis have drawbacks, particularly in HIV co-infected persons in whom radiological features are often atypical:
Consequently, a high proportion of active TB cases in sub-Saharan Africa remain undiagnosed, and post-mortem studies show TB to be a frequent, undiagnosed cause of death. There is an urgent need for improved diagnostic tests for TB, particularly in patients co-infected with HIV.
RNA expression analysis by microarray has emerged as a powerful tool for understanding disease biology. Many diseases, including cancer and infectious diseases are associated with specific transcriptional profiles in blood or tissue.
In an influential study, Berry et al (2001) found a 393 transcript signature derived in a UK cohort that was able to distinguish TB from LTBI, and an 86 transcript signature able to distinguish TB from other inflammatory diseases. However, these signatures were derived from UK populations of HIV-uninfected individuals. Therefore these signatures are of limited application in Africa, where HIV infection and LTBI are endemic.
Many previous TB diagnostic biomarker studies have focused on distinguishing patients with TB from healthy uninfected or LTBI (Maertzdorf et al 2011a 2011b, Jacobsen et al 2007) or have used other disease controls which are not representative of the real world clinical diseases from which TB needs to be distinguished in Africa (Maertzdorf et al 2012, Berry et al 2010). Furthermore, previous studies have excluded HIV co-infected patients who are in fact the group in which new diagnostics are most needed.
Thus there is a need to identify biomarkers that discriminate TB from other diseases prevalent in African populations, where the burden of the HIV/TB pandemic is greatest.
The present disclosure provides a method for detecting active TB in a subject derived sample in the presence of a complicating factor, comprising the step of detecting the modulation of at least 60% of the genes in a signature selected from the group consisting of:
Advantageously use of the appropriate signature in a method according to the present disclosure allows the robust and accurate identification of the presence of active TB or the differentiation of active TB from latent TB in the most relevant clinical setting, for example Africa. The detection is not prevented by co-morbidity in the patient, such as HIV or malaria. This is a huge step forward on the road to treating TB because it allows accurate diagnosis which, in turn, allows patients to be appropriately treated. Furthermore, the components for use in the method to detect active TB can be provided in a simple format for use in low resource and/or rural settings.
In another aspect of the disclosure there is provided a gene chip comprising one or more of the gene signatures selected from the group consisting of:
In a further aspect the present disclosure includes use of a known or commercially available gene chip in the method of the present disclosure.
Advantageously the different expression patterns represented by the gene signatures employed in the method of the present disclosure correlate across geographic location and HIV infected status (i.e. positive or negative). That is to say, the method is applicable to different geographic locations regardless of the presence or absence of HIV.
In a further aspect the present disclosure provides the treatment of active TB or latent TB after diagnosis employing the method herein.
The rows are transcripts (red=up-regulated, green=down-regulated). Columns are cases regardless of HIV status (purple are TB cases, green are LTBI, light blue are OD).
HIV+=HIV-infected, HIV−=HIV-uninfected
Definite TB case: a participant with a clinical condition consistent with tuberculosis and microbiological confirmation with evidence from at least two specimens confirming the presence of Acid Fast Bacilli (AFB) with at least one specimen confirmed on culture as MTB complex.
Latent TB infected case: a participant who is clinically assessed as healthy and not suffering from a clinical syndrome in which tuberculosis is likely. The individuals will have a tuberculin skin test (TST) size of 10 mm or more if HIV negative, or 5 mm or more if HIV positive and a positive Interferon Gamma Release Assay (IGRA) and negative sputum culture. Sputums were only collected in Malawi if cough was productive, when at least two samples would be collected. LTBI criteria were later relaxed to allow a positive TST and/or a positive IGRA to facilitate recruitment in Malawi. This change was made prior to any RNA expression measurements.
Other disease case: A participant with a disease syndrome that on presentation includes tuberculosis in the differential diagnosis, but following clinical management will have tuberculosis excluded and a firm alternative diagnosis established.
For coloured versions of the figures refer to Kaforou et al (PLOS medicine—submitted 2013)
In one embodiment there is detected the modulation of at least 60% of the genes in a signature such as 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99 or 100% providing the signature retains the ability to detect/discriminate the relevant clinical status without significant loss of specificity and/or sensitivity. The details of the gene signatures are given below.
In one embodiment the exact gene list in one or more of Tables 2, 3 and 4 is employed.
In one embodiment of the present disclosure the gene signature is the minimum set of genes required to optimally detect the infection or discriminate the disease.
Optimally is intended to mean the smallest set of genes needed to detect active TB without significant loss of specificity and/or sensitivity of the signature's ability to detect or discriminate.
Detect or detecting as employed herein is intended to refer to the process of identifying an active TB infection in a sample, in particular through detecting modulation of the relevant genes in the signature.
Discriminate refers to the ability of the signature to differentiate between different disease status, for example latent and active TB. Detect and discriminate are interchangeable in the context of the gene signature.
In one embodiment the method is able to detect an active TB infection in a sample.
Subject as employed herein is a human suspected of TB infection from whom a sample is derived. The term patient may be used interchangeably although in one embodiment a patient has a morbidity.
Modulation of gene expression as employed herein means up-regulation or down-regulation of a gene or genes.
Up-regulated as employed herein is intended to refer to a gene transcript which is expressed at higher levels in a diseased or infected patient sample relative to, for example, a control sample free from a relevant disease or infection, or in a sample with latent disease or infection or a different stage of the disease or infection, as appropriate.
Down-regulated as employed herein is intended to refer to a gene transcript which is expressed at lower levels in a diseased or infected patient sample relative to, for example, a control sample free from a relevant disease or infection or in a sample with latent disease or infection or a different stage of the disease or infection.
The modulation is measured by measuring levels of gene expression by an appropriate technique.
Gene expression as employed herein is the process by which information from a gene is used in the synthesis of a functional gene product. These products are often proteins, but in non-protein coding genes such as ribosomal RNA (rRNA), transfer RNA (tRNA) or small nuclear RNA (snRNA) genes, the product is a functional RNA. That is to say, RNA with a function.
A complicating factor as employed herein refers to at least one clinical status or at least one medical condition that would generally render it more difficult to identify the presence of active TB in the sample, for example a latent TB infection or a co-morbidity.
Co-morbidity as employed herein refers the presence of one or more disorders or diseases in addition to TB, for example malignancy such as cancer or co-infection. Co-morbidity may or may not be endemic in the general population.
In one embodiment the co-morbidity is a co-infection.
Co-infection as employed herein refers to bacterial infection, viral infection such as HIV, fungal infection and/or parasitic infection such as malaria. HIV infection as employed herein also extends to include AIDS.
In one embodiment other disease (OD) is a co-morbidity.
In one embodiment the 44 gene signature is able to detect active TB in the presence of a co-morbidity such as a co-infection. This is despite the increased inflammatory response of the patient to said other infection.
In one embodiment co-morbidity is selected from malignancy, HIV, malaria, pneumonia, Lower Respiratory Tract Infection, Pneumocystis Jirovecii Pneumonia, pelvic inflammatory disease, Urinary Tract Infection, bacterial or viral meningitis, hepatobiliary disease, cryptococcal meningitis, non-TB pleural effusion, empyema, gastroenteritis, peritonitis, gastric ulcer and gastritis.
In one embodiment malignancy is a neoplasia, such as bronchial carcinoma, lymphoma, cervical carcinoma ovarian carcinoma, mesothelioma, gastric carcinoma, metastatic carcinoma, benign salivary tumour, dermatological tumour or Kaposi's sarcoma.
In one embodiment there is provided a method for detecting active TB in a subject derived sample in the presence of a complicating factor, comprising the step of detecting the modulation of at least 60% of the genes in a signature selected from the group consisting of:
The 27 gene signature shown in Table 3 is useful in discriminating active TB infection from latent TB infection.
Active TB as employed herein refers to a person who is infected with TB which is not latent.
In one embodiment active TB is where the disease is progressing as opposed to where the disease is latent.
In one embodiment a person with active TB is capable of spreading the infection to others.
In one embodiment a person with active TB has one or more of the following: a skin test or blood test result indicating TB infection, an abnormal chest x-ray, a positive sputum smear or culture, active TB bacteria in his/her body, feels sick and may have symptoms such as coughing, fever, and weight loss.
In one embodiment a person with active TB has one or more of the following symptoms: coughing, bloody sputum, fever and/or weight loss.
In one embodiment the active TB infection is pulmonary and/or extra-pulmonary.
Pulmonary as employed herein refers to an infection in the lungs.
Extra-pulmonary as employed herein refers to infection outside the lungs, for example, infection in the pleura, infection in the lymphatic system, infection in the central nervous system, infection in the genito-urinary tract, infection in the bones, infection in the brain and/or infection in the kidneys.
Symptoms of pulmonary TB include: a persistent cough that brings up thick phlegm, which may be bloody; breathlessness, which is usually mild to begin with and gradually gets worse; weight loss; lack of appetite; a high temperature of 38° C. (100.4° F.) or above; extreme tiredness; and a sense of feeling unwell.
Symptoms of lymph node TB include: persistent, painless swelling of the lymph nodes, which usually affects nodes in the neck, but swelling can occur in nodes throughout your body; over time, the swollen nodes can begin to release a discharge of fluid through the skin.
Symptoms of skeletal TB include: bone pain; curving of the affected bone or joint; loss of movement or feeling in the affected bone or joint and weakened bone that may fracture easily.
Symptoms of gastrointestinal TB include: abdominal pain; diarrhoea and anal bleeding.
Symptoms of genitourinary TB include: a burning sensation when urinating; blood in the urine; a frequent urge to pass urine during the night and groin pain.
Symptoms of central nervous system TB include: headaches; being sick; stiff neck; changes in your mental state, such as confusion; blurred vision and fits.
Latent TB as employed herein refers to a subject who is infected with TB but is asymptomatic. A sputum test will generally be negative and the infection cannot be spread to others.
In one embodiment a person with latent TB infection has one of more of the following: a skin test or blood test result indicating TB infection, a normal chest x-ray and a negative sputum test, TB bacteria in his/her body that are alive, but inactive, does not feel sick, cannot spread TB bacteria to others
In one embodiment a person with latent TB needs treatment to prevent TB disease becoming active.
In one embodiment the method of the present disclosure is able to differentiate TB from different conditions/diseases or infections which have similar clinical symptoms.
Similar symptoms as employed herein includes one or more symptoms from pulmonary TB, lymph node TB, skeletal TB, gastrointestinal TB, genitourinary TB and/or central nervous system TB.
In one embodiment the method according to the present disclosure is performed on a subject with acute infection.
In a further embodiment the sample is a subject sample from a febrile subject, that is to say with a temperature above the normal body temperature of 37.5° C.
Thus in one embodiment DNA or RNA from the subject sample is analysed.
In one embodiment the sample is solid or fluid, for example blood or serum or a processed form of any one of the same.
A fluid sample as employed herein refers to liquids originating from inside the bodies of living people. They include fluids that are excreted or secreted from the body as well as body water that normally is not. Includes amniotic fluid, aqueous humour and vitreous humour, bile, blood serum, breast milk, cerebrospinal fluid, cerumen (earwax), chyle, endolymph and perilymph, gastric juice, mucus (including nasal drainage and phlegm), sputum, peritoneal fluid, pleural fluid, saliva, sebum (skin oil), semen, sweat, tears, vaginal secretion, vomit, urine. Particularly blood and serum.
Blood as employed herein refers to whole blood, that is serum, blood cells and clotting factors, typically peripheral whole blood.
Serum as employed herein refers to the component of whole blood that is not blood cells or clotting factors. It is plasma with fibrinogens removed.
In one embodiment the subject derived sample is a blood sample.
In one or more embodiments the analysis is ex vivo.
Ex vivo as employed herein means that which takes place outside the body.
In one embodiment one or more, for example 1 to 21, such as 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19 or 20, genes are replaced by a gene with an equivalent function provided the signature retains the ability to detect/discriminate the relevant clinical status without significant loss in specificity and/or sensitivity.
In one embodiment the genes employed have identity with genes listed in the relevant tables.
In one embodiment the 27 gene signature comprises or consists of at least up-regulated genes CD79A, CD79B, CXCR5, GNG7, CCR6, ZNF296.
In one embodiment the 27 gene signature comprises or consists of at least down-regulated genes C5, FAM20A, DUSP3, GAS6, S100A8, FCGR1B, LHFPL2, FCGR1A, MPO, FCGR1C, GAS6, C1QB, ANKRD22, FCGR1B, GBP6, C4ORF18, C1QC, FLVCR2, VAMP5, SMARCD3, and LOC728744.
In one embodiment the 27 gene signature comprises or consists of at least up-regulated genes and optionally down-regulated genes C5, FAM20A, DUSP3, GAS6, S100A8, FCGR1B, LHFPL2, FCGR1A, MPO, FCGR1C, GAS6, C1QB, ANKRD22, FCGR1B, GBP6, C4ORF18, C1QC, FLVCR2, VAMP5, SMARCD3, and LOC728744.
In one embodiment the 44 gene signature comprises or consists of at least up-regulated genes ARG1, IMPA2, RP5-1022P6.2, ORM1, EBF1, PDK4, MAK, VPREB3, HS.131087, MAP7, TMCC1, HS.162734, MAP7, and PGA5.
In one embodiment the 44 gene signature comprises or consists of at least down-regulated genes HM13BTN3A1, UGP2, CYB561, GBP6, CYB561, DUSP3, LOC196752, ALDH1A1, PRDM1, CERKL, HM13, RNF19A, MIR1974, PPPDE2, GJA9, CREB5, SERPING1, LOC389386, SEPT—4, RBM12B, CALML4, LHFPL2, CASC1, C19ORF12, HLA-DPB1, CD74, ALDH1A1, AAK1, and LOC100133800.
In one embodiment the 44 gene signature comprises or consists of at least up-regulated genes ARG1, IMPA2, RP5-1022P6.2, ORM1, EBF1, PDK4, MAK, VPREB3, HS.131087, MAP7, TMCC1, HS.162734, MAP7, PGA5 and optionally down-regulated genes HM13BTN3A1, UGP2, CYB561, GBP6, CYB561, DUSP3, LOC196752, ALDH1A1, PRDM1, CERKL, HM13, RNF19A, MIR1974, PPPDE2, GJA9, CREB5, SERPING1, LOC389386, SEPT—4, RBM12B, CALML4, LHFPL2, CASC1, C19ORF12, HLA-DPB1, CD74, ALDH1A1, AAK1, and LOC100133800.
In one embodiment the 53 gene signature comprises or consists of at least up-regulated genes GNG7, BLK, OSBPL10, CXCR5, HEY1, COL9A2, SPIB, LOC90925, ILMN—1916292, EBF1, VPREB3, TMCC1, MAP7, PGA5, and ILMN—1893697.
In one embodiment the 53 gene signature comprises or consists of at least down-regulated genes UGP2, BTN3A1, DUSP3, GBP6, CALML4, FZD2, CYB561, LHFPL2, CYB561, CASC1, RNU4ATAC, VPS13B, PPPDE2, ALDH1A1, GBP5, GAS6, SEP—4, FCGR1B, POLB, CREB5, SIGLEC11, LOC389386, DEFA1B, LOC650546, FAM26F, FCGR1A, DEFA1B, ALDH1A1, ANKRD22, IF127L2, DEFA1, MIR21, DEFA3, FCGR1C, UHMK1, CD74, IL15, and CREG1.
In one embodiment the 53 gene signature comprises or consists of at least up-regulated genes GNG7, BLK, OSBPL10, CXCR5, HEY1, COL9A2, SPIB, LOC90925, ILMN—1916292, EBF1, VPREB3, TMCC1, MAP7, PGA5, ILMN—1893697 and optionally down-regulated genes UGP2, BTN3A1, DUSP3, GBP6, CALML4, FZD2, CYB561, LHFPL2, CYB561, CASC1, RNU4ATAC, VPS13B, PPPDE2, ALDH1A1, GBP5, GAS6, SEP—4, FCGR1B, POLB, CREB5, SIGLEC11, LOC389386, DEFA1B, LOC650546, FAM26F, FCGR1A, DEFA1B, ALDH1A1, ANKRD22, IF127L2, DEFA1, MIR21, DEFA3, FCGR1C, UHMK1, CD74, IL15, and CREG1.
In one embodiment the 27 and 44 gene signatures are tested in parallel.
In one embodiment the 27 and 53 gene signatures are tested in parallel.
In one embodiment the 44 and 53 gene signatures are tested in parallel.
In one embodiment the 27, 44 and 53 gene signatures are tested in parallel.
In one embodiment each of the genes in the 27, 44 and 53 gene signatures is significantly differentially expressed in the sample with active TB compared to a comparator group.
Significantly differentially expressed as employed herein means the sample with active TB shows a log 2 fold change >0.5.
In the 27 gene signature the comparator group is LTBI.
In the 44 gene signature the comparator group is a person with “other disease” (OD), that is a disease that is not active TB but has similar symptoms.
In the 53 gene signature group the comparator group is LTBI+OD. Thus the 53 gene signature is suitable for identifying active TB in the presence of any other complicating factor.
“Presented in the form of” as employed herein refers to the laying down of genes from one or more of the signatures in the form of probes on a microarray.
Accurately and robustly as employed herein refers to the fact that the method can be employed in a practical setting, such as Africa, and that the results of performing the method properly give a high level of confidence that a true result is obtained.
High confidence is provided by the method when it provides few results that are false positives (i.e. the result suggests that the subject has active TB when they do not) and also has few false negatives (i.e. the result suggest that the subject does not have active TB when they do).
High confidence would include 90% or greater confidence, such as 91, 92, 93, 94, 95, 96, 97, 98, 99 or 100% confidence when an appropriate statistical test is employed.
In one embodiment the method provides a sensitivity of 80% or greater such as 90% or greater in particular 95% or greater, for example where the sensitivity is calculated as below:
In one embodiment the method provides a high level of specificity, for example 80% or greater such as 90% or greater in particular 95% or greater, for example where specificity is calculated as shown below:
In one embodiment the sensitivity of method of the 27 gene signature is 83 to 100%, such as 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98 or 99%.
In one embodiment the specificity of the method of the 27 gene signature is 75 to 100%, such as 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98 or 99%.
In one embodiment the sensitivity of the method of the 44 gene signature is 77 to 100%, such as 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98 or 99%.
In one embodiment the specificity of the method of the 44 gene signature is 68 to 100%, such as 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98 or 99%.
There are a number of ways in which gene expression can be measured including microarrays, tiling arrays, DNA or RNA arrays for example on gene chips, RNA-seq and serial analysis of gene expression.
Any suitable method of measuring gene modulation may be employed in the method of the present disclosure.
In one embodiment the gene expression data is generated from a microarray, such as a gene chip.
Microarray as employed herein includes RNA or DNA arrays, such as RNA arrays.
A gene chip is essentially a microarray that is to say an array of discrete regions, typically nucleic acids, which are separate from one another and are, for example arrayed at a density of between, about 100/cm2 to 1000/cm2, but can be arrayed at greater densities such as 10000/cm2.
The principle of a microarray experiment, is that mRNA from a given cell line or tissue is used to generate a labelled sample typically labelled cDNA or cRNA, termed the ‘target’, which is hybridised in parallel to a large number of, nucleic acid sequences, typically DNA or RNA sequences, immobilised on a solid surface in an ordered array. Tens of thousands of transcript species can be detected and quantified simultaneously. Although many different microarray systems have been developed the most commonly used systems today can be divided into two groups.
Using this technique, arrays consisting of more than 30,000 cDNAs can be fitted onto the surface of a conventional microscope slide. For oligonucleotide arrays, short 20-25mers are synthesised in situ, either by photolithography onto silicon wafers (high-density-oligonucleotide arrays from Affymetrix) or by ink-jet technology (developed by Rosetta Inpharmatics and licensed to Agilent Technologies).
Alternatively, pre-synthesised oligonucleotides can be printed onto glass slides. Methods based on synthetic oligonucleotides offer the advantage that because sequence information alone is sufficient to generate the DNA to be arrayed, no time-consuming handling of cDNA resources is required. Also, probes can be designed to represent the most unique part of a given transcript, making the detection of closely related genes or splice variants possible. Although short oligonucleotides may result in less specific hybridization and reduced sensitivity, the arraying of pre-synthesised longer oligonucleotides (50-100mers) has recently been developed to counteract these disadvantages.
In one embodiment the gene chip is an off the shelf, commercially available chip, for example HumanHT-12 v4 Expression BeadChip Kit, available from Illumina, NimbleGen microarrays from Roche, Agilent, Eppendorf and Genechips from Affymetrix such as HU-UI 33.Plus 2.0 gene chips.
In an alternate embodiment the gene chip employed in the present invention is a bespoke gene chip, that is to say the chip contains only the target genes which are relevant to the desired profile. Custom made chips can be purchased from companies such as Roche, Affymetrix and the like. In yet a further embodiment the bespoke gene chip comprises a minimal disease specific transcript set.
In one embodiment the chip comprises or consists of 60-100% of the 27 genes listed in Table 3.
In one embodiment the chip comprises or consists of 60-100% of the 44 genes listed in Table 4.
In one embodiment the chip comprises or consists of 60-100% of the 53 genes listed in Table 5.
In one embodiment the chip comprises or consists of 60-100% of the 27 genes listed in Table 3 in combination with 60-100% of the 44 genes listed in Table 4.
In one embodiment the chip comprises or consists of 60-100% of the 27 genes listed in Table 3 in combination with 60-100% of the 53 genes listed in Table 5.
In one embodiment the chip comprises or consists of 60-100% of the 44 genes listed in Table 4 in combination with 60-100% of the 53 genes listed in Table 5.
In one embodiment the chip comprises or consists of 60-100% of the 27 genes listed in Table 3 in combination with 60-100% of the 44 genes listed in Table 4 and 60-100% of the 53 genes listed in Table 5.
In one or more embodiments above the chip may further include 1 or more, such as 1 to 10, house-keeping genes.
In one embodiment the gene expression data is generated in solution using appropriate probes for the relevant genes.
Probe as employed herein is intended to refer to a hybridisation probe which is a fragment of DNA or RNA of variable length (usually 100-1000 bases long) which is used in DNA or RNA samples to detect the presence of nucleotide sequences (the DNA target) that are complementary to the sequence in the probe. The probe thereby hybridises to single-stranded nucleic acid (DNA or RNA) whose base sequence allows probe-target base pairing due to complementarity between the probe and target.
In one embodiment the method according to the present disclosure and for example chips employed therein may comprise one or more house-keeping genes. House-keeping genes as employed herein is intended to refer to genes that are not directly relevant to the profile for identifying the disease or infection but are useful for statistical purposes and/or quality control purposes, for example they may assist with normalising the data, in particular a house-keeping gene is a constitutive gene i.e. one that is transcribed at a relatively constant level. The housekeeping gene's products are typically needed for maintenance of the cell. Examples include actin, GAPDH and ubiquitin.
In one embodiment minimal disease specific transcript set as employed herein means the minimum number of genes need to robustly identify the target disease state.
Minimal discriminatory gene set is interchangeable with minimal disease specific transcript set.
Normalising as employed herein is intended to refer to statistically accounting for background noise by comparison of data to control data, such as the level of fluorescence of house-keeping genes, for example fluorescent scanned data may be normalized using RMA to allow comparisons between individual chips. Irizarry et al 2003 describes this method.
Scaling as employed herein refers to boosting the contribution of specific genes which are expressed at low levels or have a high fold change but still relatively low fluorescence such that their contribution to the diagnostic signature is increased.
Fold change is often used in analysis of gene expression data in microarray and RNA-Seq experiments, for measuring change in the expression level of a gene and is calculated simply as the ratio of the final value to the initial value i.e. if the initial value is A and final value is B, the fold change is B/A. Tusher et al 2001.
In programs such as Arrayminer, fold change of gene expression can be calculated. The statistical value attached to the fold change is calculated and is the more significant in genes where the level of expression is less variable between subjects in different groups and, for example where the difference between groups is larger.
In one embodiment the subject is an adult. Adult is defined herein as a person of 18 years of age or older.
In one embodiment the subject is a child. Child as employed herein refers to a person under the age of 18, such as 5 to 17 years of age.
The step of obtaining a suitable sample from the subject is a routine technique, which involves taking a blood sample. This process presents little risk to donors and does not need to be performed by a doctor but can be performed by appropriately trained support staff. In one embodiment the sample derived from the subject is approximately 2.5 ml of blood, however smaller volumes can be used for example 0.5-1 ml.
Blood or other tissue fluids are immediately placed in an RNA stabilizing buffer such as included in the Pax gene tubes, or Tempus tubes.
If storage is required then it should usually be frozen within 3 hours of collections at −80° C.
In one embodiment the gene expression data is generated from RNA levels in the sample.
For microarray analysis the blood may be processed using a suitable product, such as PAX gene blood RNA extraction kits (Qiagen).
Total RNA may also be purified using the Tripure method—Tripure extraction (Roche Cat. No. 1 667 165). The manufacturer's protocols may be followed. This purification may then be followed by the use of an RNeasy Mini kit—clean-up protocol with DNAse treatment (Qiagen Cat. No. 74106).
Quantification of RNA may be completed using optical density at 260 nm and Quant-IT RiboGreen RNA assay kit (Invitrogen—Molecular probes RI 1490). The Quality of the 28s and 18s ribosomal RNA peaks can be assessed by use of the Agilent bioanalyser.
In another embodiment the method further comprises the step of amplifying the RNA. Amplification may be performed using a suitable kit, for example TotalPrep RNA Amplification kits (Applied Biosystems).
In one embodiment an amplification method may be used in conjunction with the labelling of the RNA for microarray analysis. The Nugen 3′ ovation biotin kit (Cat: 2300-12, 2300-60).
The RNA derived from the subject sample is then hybridised to the relevant probes, for example which may be located on a chip. After hybridisation and washing, where appropriate, analysis with an appropriate instrument is performed.
In performing an analysis to ascertain whether a subject presents a gene signature indicative of disease or infection according to the present disclosure, the following steps are performed: obtain mRNA from the sample and prepare nucleic acids targets, hybridise to the array under appropriate conditions, typically as suggested by the manufactures of the microarray (suitably stringent hybridisation conditions such as 3×SSC, 0.1% SDS, at 50<0>C) to bind corresponding probes on the array, and wash if necessary to remove unbound nucleic acid targets and analyse the results.
In one embodiment the readout from the analysis is fluorescence.
In one embodiment the readout from the analysis is colorimetric.
In one embodiment physical detection methods, such as changes in electrical impedance, nanowire technology or microfluidics may be used.
In one embodiment there is provided a method which further comprises the step of quantifying RNA from the subject sample.
If a quality control step is desired, software such as Genome Studio software may be employed.
Numeric value as employed herein is intended to refer to a number obtained for each relevant gene, from the analysis or readout of the gene expression, for example the fluorescence or colorimetric analysis. The numeric value obtained from the initial analysis may be manipulated, corrected and if the result of the processing is a still a number then it will be continue to be a numeric value.
By converting is meant processing of a negative numeric value to make it into a positive value or processing of a positive numeric value to make it into a negative value by simple conversion of a positive sign to a negative or vice versa.
Analysis of the subject-derived sample will for the genes analysed will give a range of numeric values some of which are positive (preceded by + and in mathematical terms considered greater than zero) and some of which are negative (preceded by − and in strict mathematical terms are considered to less than zero). The positive and negative in the context of gene expression analysis is a convenient mechanism for representing genes which are up-regulated and genes which are down regulated.
In the method of the present disclosure either all the numeric values of genes which are down-regulated and represented by a negative number are converted to the corresponding positive number (i.e. by simply changing the sign) for example −1 would be converted to 1 or all the positive numeric values for the up-regulated genes are converted to the corresponding negative number.
The present inventors have established that this step of rendering the numeric values for the gene expressions positive or alternatively all negative allows the summating of the values to obtain a single value that is indicative of the presence of disease or infection or the absence of the same.
This is a huge simplification of the processing of gene expression data and represents a practical step forward thereby rendering the method suitable for routine use in the clinic.
By discriminatory power is meant the ability to distinguish between a TB infected and a non-infected sample (subject) or between active TB infection and other infections (such as HIV) in particular those with similar symptoms or between a latent infection and an active infection.
The discriminatory power of the method according to the present disclosure may, for example, be increased by attaching greater weighting to genes which are more significant in the signature, even if they are expressed at low or lower absolute levels.
As employed herein, raw numeric value is intended to, for example refer to unprocessed fluorescent values from the gene chip, either absolute fluorescence or relative to a house keeping gene or genes.
Summating as employed herein is intended to refer to act or process of adding numerical values.
Composite expression score as employed herein means the sum (aggregate number) of all the individual numerical values generated for the relevant genes by the analysis, for example the sum of the fluorescence data for all the relevant up and down regulated genes. The score may or may not be normalised and/or scaled and/or weighted.
In one embodiment the composite expression score is normalised.
In one embodiment the composite expression score is scaled.
In one embodiment the composite expression score is weighted.
Weighted or statistically weighted as employed herein is intended to refer to the relevant value being adjusted to more appropriately reflect its contribution to the signature.
In one embodiment the method employs a simplified risk score as employed in the examples herein.
Simplified risk score is also known as disease risk score (DRS).
Control as employed herein is intended to refer to a positive (control) sample and/or a negative (control) sample which, for example is used to compare the subject sample to, and/or a numerical value or numerical range which has been defined to allow the subject sample to be designated as positive or negative for disease/infection by reference thereto.
Positive control sample as employed herein is a sample known to be positive for the pathogen or disease in relation to which the analysis is being performed, such as active TB.
Negative control sample as employed herein is intended to refer to a sample known to be negative for the pathogen or disease in relation to which the analysis is being performed.
In one embodiment the control is a sample, for example a positive control sample or a negative control sample, such as a negative control sample.
In one embodiment the control is a numerical value, such as a numerical range, for example a statistically determined range obtained from an adequate sample size defining the cut-offs for accurate distinction of disease cases from controls.
Conversion of multi-gene transcript disease signatures into a single number disease score
Once the RNA expression signature of the disease has been identified by variable selection, the transcripts are separated based on their up- or down-regulation relative to the comparator group. The two groups of transcripts are selected and collated separately.
To identify the single disease risk score for any individual patient, the raw intensities, for example fluorescent intensities (either absolute or relative to housekeeping standards) of all the up-regulated RNA transcripts associated with the disease are summated. Similarly summation of all down-regulated transcripts for each individual is achieved by combining the raw values (for example fluorescence) for each transcript relative to the unchanged housekeeping gene standards. Since the transcripts have various levels of expression and respectively their fold changes differ as well, instead of summing the raw expression values, they can be scaled and normalised between 0,1. Alternatively they can be weighted to allow important genes to carry greater effect. Then, for every sample the expression values of the signature's transcripts are summated, separately for the up- and down-regulated transcripts.
The total disease score incorporating the summated fluorescence of up- and down-regulated genes is calculated by adding the summated score of the down-regulated transcripts (after conversion to a positive number) to the summated score of the up-regulated transcripts, to give a single number composite expression score. This score maximally distinguishes the cases and controls and reflects the contribution of the up- and down-regulated transcripts to this distinction.
The composite expression scores for patients and the comparator group may be compared, in order to derive the means and variance of the groups, from which statistical cut-offs are defined for accurate distinction of cases from controls. Using the disease subjects and comparator populations, sensitivities and specificities for the disease risk score may be calculated using, for example a Support Vector Machine and internal elastic net classification.
Disease risk score as employed herein is an indicator of the likelihood that patient has active TB when comparing their composite expression score to the comparator group's composite expression score.
Development of the Disease Risk Score into a Simple Clinical Test for Disease Severity or Disease Risk Prediction
The approach outlined above in which complex RNA expression signatures of disease or disease processes are converted into a single score which predicts disease risk can be used to develop simple, cheap and clinically applicable tests for disease diagnosis or risk prediction.
The procedure is as follows: For tests based on differential gene expression between cases and controls (or between different categories of cases such as severity), the up- and down-regulated transcripts identified as relevant may be printed onto a suitable solid surface such as microarray slide, bead, tube or well.
Up-regulated transcripts may be co-located separately from down-regulated transcripts either in separate wells or separate tubes. A panel of unchanged housekeeping genes may also be printed separately for normalisation of the results.
RNA recovered from individual patients using standard recovery and quantification methods (with or without amplification) is hybridised to the pools of up- and down-regulated transcripts and the unchanged housekeeping transcripts.
Control RNA is hybridised in parallel to the same pools of up- or down-regulated transcripts.
Total value, for example fluorescence for the subject sample and optionally the control sample is then read for up- and down-regulated transcripts and the results combined to give a composite expression score for patients and controls, which is/are then compared with a reference range of a suitable number of healthy controls or comparator subjects.
The details above explain how a complex signature of many transcripts can be reduced to the minimum set that is maximally able to distinguish between patients and other phenotypes. For example, within the up-regulated transcript set, there will be some transcripts that have a total level of expression many fold lower than that of others. However, these transcripts may be highly discriminatory despite their overall low level of expression. The weighting derived from the elastic net coefficient can be included in the test, in a number of different ways. Firstly, the number of copies of individual transcripts included in the assay can be varied. Secondly, in order to ensure that the signal from rare, important transcripts are not swamped by that from transcripts expressed at a higher level, one option would be to select probes for a test that are neither overly strongly nor too weakly expressed, so that the contribution of multiple probes is maximised. Alternatively, it may be possible to adjust the signal from low-abundance transcripts by a scaling factor.
Whilst this can be done at the analysis stage using current transcriptomic technology as each signal is measured separately, in a simple colorimetric test only the total colour change will be measured, and it would not therefore be possible to scale the signal from selected transcripts. This problem can be circumnavigated by reversing the chemistry usually associated with arrays. In conventional array chemistry, the probes are coupled to a solid surface, and the amount of biotin-labelled, patient-derived target that binds is measured. Instead, we propose coupling the biotin-labelled cRNA derived from the patient to an avidin-coated surface, and then adding DNA probes coupled to a chromogenic enzyme via an adaptor system. At the design and manufacturing stage, probes for low-abundance but important transcripts are coupled to greater numbers, or more potent forms of the chromogenic enzyme, allowing the signal for these transcripts to be ‘scaled-up’ within the final single-channel colorimetric readout. This approach would be used to normalise the relative input from each probe in the up-regulated, down-regulated and housekeeping channels of the kit, so that each probe makes an appropriately weighted contribution to the final reading, which may take account of its discriminatory power, suggested by the weights of variable selection methods.
The detection system for measuring multiple up or down regulated genes may also be adapted to use rTPCR to detect the transcripts comprising the diagnostic signature, with summation of the separate pooled values for up and down regulated transcripts, or physical detection methods such as changes in electrical impedance. In this approach, the transcripts in question are printed on nanowire surfaces or within microfluidic cartridges, and binding of the corresponding ligand for each transcript is detected by changes in impedance or other physical detection system
The present disclosure extends to a custom made chip comprising a minimal discriminatory gene set for diagnosis of active TB from other conditions, in particular those with similar symptoms, for example comprising at least 60-100% of the 27 genes listed in Table 3, and/or 60-100% of the 44 genes listed in Table 4, and/or 60-100% of the 53 genes listed in Table 5.
In one embodiment the gene chip is a fluorescent gene chip that is to say the readout is fluorescence.
Fluorescence as employed herein refers to the emission of light by a substance that has absorbed light or other electromagnetic radiation.
Thus in an alternate embodiment the gene chip is a colorimetric gene chip, for example colorimetric gene chip uses microarray technology wherein avidin is used to attach enzymes such as peroxidase or other chromogenic substrates to the biotin probe currently used to attach fluorescent markers to DNA. The present disclosure extends to a microarray chip adapted to read by colorimetric analysis and adapted for the analysis of active TB infection in a patient. The present disclosure also extends to use of a colorimetric chip to analyse a subject sample for active TB infection.
Colorimetric as employed herein refers to as assay wherein the output is in the human visible spectrum.
In an alternative embodiment, a gene set indicative of active TB may be detected by physical detection methods including nanowire technology, changes in electrical impedance, or microfluidics.
The readout for the assay can be converted from a fluorescent readout as used in current microarray technology into a simple colorimetric format or one using physical detection methods such as changes in impedance, which can be read with minimal equipment. For example, this is achieved by utilising the Biotin currently used to attach fluorescent markers to DNA. Biotin has high affinity for avidin which can be used to attach enzymes such as peroxidase or other chromogenic substrates. This process will allow the quantity of cRNA binding to the target transcripts to be quantified using a chromogenic process rather than fluorescence. Simplified assays providing yes/no indications of disease status can then be developed by comparison of the colour intensity of the up- and down-regulated pools of transcripts with control colour standards. Similar approaches can enable detection of multiple gene signatures using physical methods such as changes in electrical impedance.
This aspect of the invention is likely to be particularly advantageous for use in remote or under-resourced settings or for rapid diagnosis in “near patient” tests. For example, places in Africa because the equipment required to read the chip is likely to be simpler.
Multiplex assay as employed herein refers to a type of assay that simultaneously measures several analytes (often dozens or more) in a single run/cycle of the assay. It is distinguished from procedures that measure one analyte at a time.
In one embodiment there is provided a bespoke gene chip for use in the method, in particular as described herein.
In one embodiment there is provided use of a known gene chip for use in the method described herein in particular to identify one or more gene signatures described herein.
In one embodiment there is provided a method of treating latent TB after diagnosis employing the method disclosed herein.
In one embodiment there is provided a method of treating active TB after diagnosis employing the method disclosed herein.
Gene signature, gene set, disease signature, diagnostic signature and gene profile are used interchangeably throughout and should be interpreted to mean gene signature.
In the context of this specification “comprising” is to be interpreted as “including”.
Aspects of the invention comprising certain elements are also intended to extend to alternative embodiments “consisting” or “consisting essentially” of the relevant elements.
Where technically appropriate, embodiments of the invention may be combined.
Embodiments are described herein as comprising certain features/elements. The disclosure also extends to separate embodiments consisting or consisting essentially of said features/elements.
Technical references such as patents and applications are incorporated herein by reference.
Any embodiments specifically and explicitly recited herein may form the basis of a disclaimer either alone or in combination with one or more further embodiments.
The overall plan of the study is shown in
Cape Town, South Africa (SA):
SA has one of the highest TB incidence rates in Africa (981 per 100,000), as well as high rates of HIV infection (up to 41.8% prevalence in females aged 25-35). Patients undergoing investigation for suspected TB were recruited at GF Jooste Hospital Manenberg, Groote Schuur Hospital and at Khayelitsha site B, clinics serving the largely Xhosa population residing in the low income townships of Cape Town. Malaria is not endemic in these urban populations.
Karonga, Northern Malawi:
The incidence of new tuberculosis cases in Karonga district (180 per 100,000, Karonga Prevention Study unpublished data 2012) and the stable HIV prevalence (10-15% of females aged 25-29, Karonga Prevention Study unpublished data 2012) are lower in Karonga than Cape Town, and malaria and helminth infection are hyperendemic (that is, there is a high and continued incidence of disease). Patients were recruited at Karonga District hospital which serves a rural population living by the shores of Lake Malawi.
To ensure accurate assignment of patients to definite TB and OD groups, a rigorous diagnostic process was followed. All patients underwent chest radiographs and serological testing for HIV, along with cultures of blood, CSF and urine, and biopsies for histological examination including TB culture where clinically indicated. Two sputum samples obtained after induction or coughing were examined by standard microscopy for acid fast bacilli (AFB) and cultured for TB using standard methods (Crampin et al 2001). Patients were followed up 26 weeks post diagnosis to confirm that those with other diseases remained TB-free. Healthy LTBI controls were recruited by random community selection (Malawi) and from HIV screening clinics (SA) from the same catchment areas as patients with TB (
Following the diagnostic work-up, patients were assigned to groups using the following definitions (
Definite TB case (TB): a participant with a clinical condition consistent with tuberculosis, and mycobacteria confirmed to be M.TB complex cultured from sputum or tissue samples. Confirmation of mycobacterial species was undertaken by Gen-Probe assay (Roche).
Latent TB infected case (LTBI): a participant who is clinically assessed as healthy and not suffering from a clinical syndrome in which tuberculosis is likely. The individuals will have a TST of 10 mm or more if HIV-uninfected, or 5 mm or more if HIV-infected and a positive IGRA and negative sputum culture. Sputum was only collected if the cough was productive, when at least two samples were collected. LTBI criteria were relaxed in the second year of the study to allow a positive TST and/or a positive IGRA to facilitate recruitment in Malawi. This change was made prior to any RNA expression measurements.
Other disease case (OD): A participant with a disease syndrome that on presentation includes tuberculosis in the differential diagnosis, but following clinical investigation and management, tuberculosis was excluded and a firm alternative diagnosis established.
Between January 2007 and June 2011, we recruited patients with suspected TB or other diseases (OD) in which the assessing clinician considered TB to be within the differential diagnosis. All patients underwent chest radiographs and serological testing for HIV, TST, cultures of blood, CSF and urine, and biopsies for histological examination (including TB culture where clinically indicated). Two sputum samples obtained after induction or coughing (Crampin et al 2001) were examined by standard microscopy for acid fast bacilli (AFB) and cultured for TB. Confirmation of mycobacterial species was undertaken by Gen-Probe assay (Roche). Patients were followed to confirm that those with OD remained TB-free for 26 weeks post diagnosis.
Healthy LTBI controls were recruited by random community selection (Malawi) and from HIV screening clinics (SA) from the same catchment areas as TB cases. In vitro IGRA to substantiate LTBI was undertaken using an in-house whole blood assay (Hussain et al 2002) (ESAT6 and CFP10 (Franken et al 2000) antigens supplied by THO). A rigorous diagnostic process and group definitions were implemented to ensure accurate assignment to TB, LTBI and OD groups (
The study was approved by the Human Research Ethics Committee of the University of Cape Town, South Africa (HREC012/2007), the National Health Sciences Research Committee, Malawi NHSRC/447), and the Ethics Committee of the London School of Hygiene and Tropical Medicine (5212). Written information was provided by trained local health workers in local languages and all patients provided written consent.
Patients were recruited to the study by local health care workers. Assignment of patients to clinical groups was made by consensus of experienced clinicians at each site (independent of those managing the patient clinically) after review of the investigation results. Testing for HIV status was conducted after appropriate counselling. Clinical data was anonymised and patient samples were identified only by study number. Microarrays were conducted by laboratory personnel blinded to assigned patient diagnostic groups. Statistical analysis was conducted only after the RNA expression data and clinical databases had been locked and deposited for independent verification.
Whole blood was collected at the time of recruitment (either before or within 24 hours of commencing TB treatment in suspected cases) in PAXgene® tubes, frozen within 3 hours of collection and later extracted using PAXgene® blood RNA extraction kits (Qiagen). RNA was shipped frozen to the Genome Institute of Singapore for analysis on HumanHT-12 v4 Expression BeadChips (Illumina).
Whole blood (2.5 ml) was collected into PAXgene™ blood RNA tubes (PreAnalytiX, Germany), incubated for 2 hours, frozen at −20° C. within 3 hours of collection, and then stored at −80° C. RNA was extracted using PAXgene™ blood RNA kits (PreAnalytiX, Germany) according to the manufacturer's instructions at one site (Cape Town) to minimize any sample handling bias. The integrity and yield of the total RNA was assessed using an Agilent 2100 Bioanalyser and a NanoDrop 1000 spectrophotometer respectively. Total RNA was then shipped to the Genome Institute of Singapore. After quantification and quality control, biotin-labelled cRNA was prepared using Illumina TotalPrep RNA Amplification kits (Applied Biosystems) from 500 ng RNA. Labelled cRNA was hybridized overnight to Human HT-12 V4 Expression BeadChip arrays (Illumina). After washing, blocking and staining, the arrays were scanned using an Illumina BeadArray Reader according to the manufacturer's instructions. Using Genome Studio software the microarray images were inspected for artefacts and QC parameters were assessed. No arrays were excluded at this stage.
Expression data were analysed using R′ Language and Environment for Statistical Computing (R) 2.12.1. To identify transcript signatures applicable across geographic locations and in patients with differing HIV status, we combined HIV-infected and -uninfected patient cohorts from SA and Malawi. The recruited subjects were randomly assigned to a “training” cohort (80% of the subjects) and a test cohort (20%) with no overlap. For additional validation we used the whole blood expression dataset of Berry et al. comparing TB with LTBI and other infections in an UK and an Africa cohort (accession GSE19491).
To detect transcripts that were differentially expressed between TB cases and comparator groups, a linear model was fitted and moderated t-statistics calculated for each transcript with correction for false discovery using Benjamini and Hochberg's method (1995). To identify the smallest number of transcripts distinguishing TB from the comparator groups, significantly differentially expressed (SDE) transcripts in the discovery cohort with a log 2 fold change (FC)>0.5 were subjected to variable selection using elastic net. These minimal transcript selected sets for TB vs. LTBI, TB vs. OD and TB vs. LTBI+OD were assessed in the test cohort and further evaluated using independent datasets (Berry et al 2010).
Mean raw intensity values for each probe were corrected for local background intensities and a robust spline normalisation (combining quantile normalisation and spline interpolation) was applied to each array. Expression values were transformed to a logarithmic scale (base 2), and for each probe. Differential expression between patient groups was identified by fitting a linear model to each transcript using LIMMA2. P-values were adjusted using the method of Benjamini and Hochberg. Transcripts with log FC >0.5 were taken forward to variable selection with elastic net. This threshold was chosen in order to ensure that differential expression for selected variables could be distinguished using the resolution of qtPCR. The a and X parameters of elastic net, which control the size of the selected model, were optimized via ten-fold cross-validation (CV). The weights assigned by elastic net to the trained model were used within a linear regression model to classify samples in the test set.
Current whole genome array-based technologies are not well suited for use in resource poor settings as they are costly and require sophisticated technology as well as bioinformatics expertise. We therefore developed a method for translation of multiple transcript RNA signatures into a disease risk score, which could form the basis of a simple, low cost, diagnostic test requiring basic laboratory facilities and minimal bioinformatics analysis. For each individual, we calculated the disease risk score using the minimal transcript selected sets for TB vs. LTBI, TB vs. OD and TB vs. LTBI+OD. The score is derived by adding the total intensity at up-regulated transcripts, and subtracting the total intensity at all down-regulated transcripts. The sensitivity and specificity of this score in disease classification was evaluated on test and validation cohorts.
Where μn is the mean of comparator group n, and σn is the standard deviation of comparator group n. The performance of the simplified risk score was then evaluated in our cohort as well as the independent datasets.
For each individual, we calculated the disease risk score using the minimal transcript selected sets for TB vs. LTBI, TB vs. OD and TB vs. LTBI+OD. The score is based on subtracting the summed intensities of the down-regulated transcripts from the summed intensities of the up-regulated transcripts. The risk score was calculated on normalised intensities. The disease risk score for individual i is:
where: n the number of upregulated number of probes in the signature in disease of interest compared to comparator group(s).
The threshold for the classification was calculated as the weighted average of risk score within each class, with weights given as inverse of the standard deviation of the score within each class (1/sd1 and 1/sd2 respectively). The threshold for the classification between group u and v is shown below:
where: μ=average of the disease risk score in the group.
σ=standard deviation of the disease risk score in the group.
To calculate the indeterminate zone, we calculated the lower and upper threshold which were calculated as the weighted average with weights given by w/sd1, (1−w)/sd2 respectively for variable 0.5<w<=1. When w=0.5 its equivalent formula to main threshold. ROCs were generated using pROC5.
Alternatively:
To calculate the indeterminate zone, we calculated the lower and upper threshold which were calculated as the weighted average with weights given
respectively:
When w=1 the formula is equivalent to the main threshold formula.
To evaluate the performance of the DRS as a classifier we used different measures (AUC, sensitivity, specificity, PPV, NPV, and likelihood ratios).
The calculation of the confidence intervals for the area under a receiver operating characteristic curve (AUC), the sensitivity and the specificity was based on a non-parametric stratified bootstrap resampling (each replicate contained the same number of cases and controls as the original sample) (Robin et al 2011), with 2000 bootstraps, as recommended by Carpenter et al. (2000). We also employed the exact binomial (Clopper et al 1934) to calculate the confidence intervals (Table 9).
We used the estimated sensitivity and specificity to calculate the positive and negative predictive values (PPV and NPV) using the following formulas:
and interpreting the prevalence as “the probability before the test is carried out that the subject has the disease” as suggested by D. Altman (1994). In this case, we assumed a clinical setting, such as the one used to recruit samples in Malawi, in which approximately 58% of patients with suspected TB had culture confirmed TB (254 TB confirmed cases/437 patients with suspected TB), as well as calculating more conservative values assuming a prevalence of 20% (as a more typical proportion would be 15%-25% in quality controlled laboratories in primary care settings in high-burden countries in sub-Saharan Africa). PPV and NPV can be interpreted as the probability that a sample with a positive test has active TB, and the probability that a sample with a negative test result does not have active TB respectively, and as such represent the diagnostic value of a test (Table S5). We also report positive and negative likelihood ratios along with their confidence intervals employing the method described in (Simel et al 1991) (Table 2A, 2B).
Although the models suggested by elastic net were the smallest ones to provide us with the best classification, we wanted to further explore the performance of even smaller lists of transcripts. Instead of optimizing via ten-fold cross-validation (CV) both the α and λ parameters of elastic net which control the size of the selected model, we used α=1 which is the penalty for lasso that gives smaller models. Then, within the cross validation step of choosing λ, we forced the penalty to be such that the error would remain within one standard deviation of minimum error. This process resulted in 21 transcripts for the TB vs. LTBI comparison (12 overlapping with the 27 transcript signature) and 29 transcripts for the TB vs. OD comparison (14 overlapping with the 44 transcript signature). Smaller models have reduced sensitivity (6%-10% lower than the original models) while specificity remained the same (Table 11). When DRS was calculated sensitivity and specificity were 89% CI95%[78-97] and 89% CI95%[79-97] respectively for the TB vs. LTBI comparison. As for the TB vs. OD comparison, when DRS was calculated sensitivity and specificity were 83% CI95%[69-93] and 88% CI95%[76-97] respectively. Smaller models have mainly reduced sensitivity.
We have included 31 smear-negative patients with TB (with definite negative smear status) in the analysis of the adult cohort (7 TB HIV-uninfected and 24 TB HIV-infected). The TB/LTBI and the TB/OD DRSs were applied to these patients and as controls we used the LTBI and OD patients from the test set, while maintaining the same threshold. The performance of the TB/LTBI signature was comparable to the performance in the HIV-infected group and the performance of the TB/OD signature was almost the same as in the larger smear-negative and smear-positive group. Confidence intervals for the sensitivity and specificity of smear-negative patients with TB were calculated using both the bootstrapping and the exact binomial method (Table 12). These confidence intervals overlapped the corresponding CIs for the larger smear-positive and smear-negative group.
For validation of the performance of the disease risk score based on the TB vs. LTBI 27 transcript signature, TB vs. OD 44 transcript signature and TB vs. LTBI+OD 53 transcript signature, we used the whole blood expression dataset of Berry et al. generated using Illumina HT12 V3 Beadarrays comparing TB with LTBI and other infections in an UK and an Africa cohort (accession series GSE19491). For each testing dataset (UK GSE19444; SA GSE19442, OD GSE22098), both quantile and robust spline normalisation were applied separately to the arrays and the data was log transformed—however the results were the same regardless of normalisation method.
For the evaluation of the performance of our TB vs. LTBI 27 transcript signature, we used TB and LTBI patients in both of the normalized testing sets (UK TB n=21, LTBI n=21; SA TB n=20, LTBI n=31). The probe ILMN—3247506 (FCGR1C) in the TB vs. LTBI signature was not on the HT12 V3 beadarray. For the evaluation of the performance of our 44 TB vs. OD transcript signature, we used TB patients from the normalized testing sets (UK testing TB n=21, SA TB n=20) and OD patients that did not include systemic lupus erythematosus as they were judged to be a rare disease in an African setting (n=82). The probes ILMN—3287952 (LOC100133800), ILMN—3215715 (LOC389386) and ILMN—3308961 (MIR1974) in the TB vs. OD signature were not on the HT12 V3 beadchip.
For testing the performance of the reported 393 TB vs. LTBI signature and the 86 TB vs. OD signature on our African dataset, the disease risk score was calculated with these signatures as previously described, although 7 probes in the reported signatures were not present on the HT-12 V4 Beadchip (TB vs. LTBI 6 probes, TB vs. OD 1 probe).
In order to compare directly the differences of the performance of our signatures to the signatures presented in the Berry et al (2010), we calculated the differences of the means of the measures of classification (namely the AUC, the sensitivity and the specificity) on our test set along with their 95% confidence intervals, using the following mathematical formulas:
The RNA signatures distinguishing TB from OD and LTBI were analysed through the use of IPA (Ingenuity® Systems, www.ingenuity.com), which identifies pathways and functions overrepresented in the datasets.
We recruited 311 adult patients to the South African cohort and 273 to the Malawi cohort (
We performed quality control on the microarray data in order to examine the effect of disease state on the transcript expression and to check for assignment errors. Visual inspection revealed that the primary clustering was based on disease state (TB, LTBI, OD) rather than geographical location or HIV status (
To find minimal transcript sets required to discriminate TB from other groups we applied the variable selection algorithm elastic net to the training cohort. A 27 transcript model was identified for discriminating TB from LTBI in the Malawi/SA training and test set (
To evaluate the feasibility of using a simplified diagnostic test based on our transcript sets for TB diagnosis in low resource settings, we applied the disease risk score to our test cohort and to the UK and SA cohort data reported by Berry et al. In our combined HIV-infected and -uninfected test set, the 27 transcript disease risk score discriminated TB from LTBI with sensitivity and specificity of 95% and 90% respectively, whilst achieving perfect classification in the HIV-uninfected cohorts and slightly reduced accuracy in the HIV-infected cohorts (Table 2A,
In order to evaluate the classificatory power of the DRS, we compared its performance with the regression model derived from the elastic net based on the same signatures (Table 6). We found that our DRS had similar accuracy in distinguishing TB from LTBI and OD to the weighted regression model. In order to assess the predictive value of our DRS in a cohort of patients undergoing investigation for persistent symptoms such as cough, fever and weight loss i.e. where TB was included in the differential diagnosis, we used the prevalence of TB in our prospective Malawi cohort (58%; 254 confirmed TB cases of 437 patients with suspected TB) to calculate the positive and negative predictive value (PPV/NPV). The DRS for TB vs. OD had a PPV of 92% CI95%[84-99] and a NPV of 90% CI95%[80-100%] (Table 10). Using a 20% prevalence which may be more reflective of a general primary care setting in a high-burden African country, NPV for TB vs. OD is higher (98% CI95%[96-100]), but PPV decreases (66% CI95%[46-87]), emphasizing the value of DRS as a rule-out test, with those patients with positive DRS selected for further investigation (Table 10).
We also explored the effect of adjusting the threshold for the DRS in assigning individual patients to TB or LTBI/OD. By accepting a percentage of patients as ‘non-classifiable’, the majority of patients under investigation are accurately assigned. These ‘non-classifiable’ patients could then be selected for more detailed investigation (
As it would be advantageous to have a single signature that distinguished TB from non-TB, we assessed the performance of a signature in distinguishing TB from both TB and LTBI. A 53 transcript signature was identified (Table 5) that distinguished TB from both LTBI and OD with sensitivity/specificity 91%/82%—a lower performance than TB/LTBI and TB/OD signatures alone. We also explored whether a smaller number of transcripts could be used to distinguish TB from LTBI and from OD which would aid in manufacturing of a test, resulting in a 21 and 29 probe signature for distinguishing TB from LTBI and OD respectively. The sensitivity of the smaller models was 6%-10% lower than the original models, while retaining the same specificity (Table 11).
In order to compare our minimal transcript signatures, derived from prospectively recruited African cohorts of HIV-infected and -uninfected patients with TB, OD and LTBI, with the previously reported signatures derived only from HIV-uninfected patients, and from OD that were not recruited during a prospective evaluation of patients in whom TB was included in the differential diagnosis, we compared the performance of our 27 probe TB/LTBI signature and our 44 probe TB/OD signature with the performance of the signatures of Berry et al. for discrimination of TB vs. LTBI (393 transcripts) and TB vs. OD (86 transcripts). While the 393 TB/LTBI signature achieved a sensitivity of 88% CI95%[80-94] and a specificity of 84% CI95%[76-92] on our TB HIV-uninfected cohorts, the performance on the HIV-infected group was 74% CI95%[65-82] and 80% CI95%[71-87] respectively (Table 2B,
We evaluated the performance of our signatures in the smear-negative sub-group of patients with TB, the majority of whom were HIV-infected (31 smear-negative TB patients with definite negative smear status; 7 TB HIV-uninfected and 24 TB HIV-infected). In the smear-negative patients the DRS showed a sensitivity for detecting TB of 68% CI95%[52-84] when using the TB vs. LTBI signature and a sensitivity of 90% CI95%[81-100] with the TB/OD signature, both of which are comparable to results obtained in the larger HIV-infected cohort of smear-positive and -negative patients. As we used the same LTBI and OD patients from the test set, the specificity was unchanged (90% CI95%[80-97] for TB vs. LTBI and 88% CI95%[74-97] for TB vs. OD, Table 12).
Finally, we also tested the signatures of Berry et al. for discrimination of TB vs. LTBI (393 transcripts) and TB vs. OD (86 transcripts) on our cohorts using the disease risk score. While the TB vs. LTBI signature gave good classification on our TB HIV-uninfected cohorts (sensitivity 88%; specificity 84%), the performance on the HIV-infected group was less good (sensitivity 74%; specificity 80%) (Table 2B,
Initial assignment (using IPA) of the 27 probe set distinguishing TB from LTBI, the 44 probe set distinguishing TB from OD, and the 53 probe set distinguishing TB from non TB, revealed that genes comprising each signature formed highly significant networks of genes that were involved in the inflammatory response, cell-to-cell signalling and interaction, as well as dendritic cell maturation (
We have identified a host blood transcriptomic signature that distinguishes TB from a wide range of other conditions prevalent in HIV-infected and -uninfected Africans. We found that patients with TB can be distinguished from LTBI with only 27 transcripts, from OD with 44 transcripts and from LTBI and OD with 53 transcripts. Our finding appears robust as the results are reproducible in both HIV-infected and -uninfected cohorts, in different geographic locations, and in independent, publicly available datasets. The high sensitivity and specificity of our signatures in distinguishing TB from OD even in the HIV-infected patients that have differing levels of T cell depletion and a wide spectrum of opportunistic infections as well as HIV-related complications, suggest that the signatures are reliable markers of TB. The relatively small number of transcripts in our signatures suggests the potential to use RNA expression from a single peripheral blood sample as a clinical diagnostic tool (i.e. using a multiplex assay Joosten et al 2012, Eldering et al 2003).
Our signatures and the disease risk score accurately distinguish the majority of patients who have TB from those with OD and/or LTBI in whom TB is excluded.
Our study provides proof of principle that diagnosis of active TB in African countries affected by the HIV/TB epidemic is feasible using RNA expression on peripheral blood.
(64.7-293)1
14 missing values.
210 missing values.
333 missing values, not routinely performed in the work up of TB+/HIV+ patients.
†(14) Bronchial carcinoma, (4) Lymphoma, (1) Cervical carcinoma, (1) Ovarian carcinoma, (1) mesothelioma, (1) gastric carcinoma, (4) metastatic carcinoma of unknown origin, (1) benign salivary tumour, (1) Dermatological tumour
Number | Date | Country | Kind |
---|---|---|---|
1213636.2 | Jul 2012 | GB | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2013/065887 | 7/29/2013 | WO | 00 |