This application claims priority from prior Japanese Patent Application No. 2018-163954, filed on Aug. 31, 2018, entitled “Analysis Method, Information Processing Apparatus, Gene Analysis System, Program, and Storage Medium”, the entire content of which is incorporated herein by reference.
The present invention relates to an analysis method, an information processing apparatus, and the like that analyze base sequences of genes.
Conventionally, technologies that analyze base sequences of genes have been utilized as important analysis techniques in the fields of basic study, clinical study, medical care, and the like. In recent years, panel tests using gene panels that allow comprehensive checking of abnormalities in genes of a subject (for example, a patient) by use of next generation sequencing (NGS) have been developed. Such panel tests are expected to play an important role in individualized medical care. Here, individualized medical care denotes medical care in which an appropriate therapeutic strategy is selected for each patient in consideration of characteristics such as the genetic background, the physiological condition, and the state of the disease of the patient.
Among technologies that analyze base sequences of genes, NGS is an indispensable technology for comprehensively detecting abnormalities in base sequences in genes. For example, “An introduction to Next-Generation Sequencing Technology”, [online], Illumina, Inc. [searched on Aug. 30, 2018], Internet <https://www.illumina.com/content/dam/illumina-marketing/documents/products/illumina_sequencing_introduction.pdf> describes a technique of simultaneously analyzing base sequences of genes derived from samples of a plurality of subjects by use of NGS.
When base sequences of genes of a plurality of subjects are simultaneously analyzed by use of NGS, the analysis is performed through steps I to V shown in
In step I, a sample A and a sample B are fragmented, and a library A of the sample A and a library B of the sample B are prepared. Here, the “sample A” may be genes derived from a tissue collected from a subject A, and the “sample B” may be genes derived from a tissue collected from a subject B, for example. In this step, adapter sequences are added to fragments of the sample A and the sample B. In this step, an index sequence 1 (“AAAAAAAA” in the drawing) is added to fragments (for example, DNA fragments) of the sample A, and an index sequence 2 (“BBBBBBBB” in the drawing) is added to fragments (for example, DNA fragments) of the sample B. The “adapter sequence” is an oligonucleotide that is added to each fragment in order to capture the fragment so that sequencing reaction is caused in a flow cell for a sequencer that performs sequencing. The “index sequence” is an oligonucleotide that has a length of several bases to several tens of bases and that is added to each fragment in order to distinguish sequence information derived from fragments of the sample A, from sequence information derived from fragments of the sample B in a later step IV.
Subsequently, in step II, the library A and the library B are mixed together, and the mixture is applied to a flow cell. In step III, sequencing reaction occurs on the flow cell, and sequence information is obtained. The obtained sequence information includes base sequence data of fragments of the sample A and base sequence data of fragments of the sample B.
Next, in step IV, sequence information is sorted on the basis of the index sequence included therein, and the sorted sequence information is stored in a file created for each sample. Then, in step V, sequence information is read out from each file, and alignment is performed for each of the sample A and the sample B.
When base sequences of genes of a plurality of subjects are simultaneously analyzed by use of NGS, each step is determined on the basis of a protocol that is recommended for the sequencer to be used and the gene panel to be used. In addition, various reference values determined on the basis of the recommended protocol are set for an existing analysis program that is to be used in the analysis of the base sequences of genes.
For example, a suitable protocol is recommended in accordance with the specification of a flow cell that suits the sequencer to be used, and in accordance with the amounts and the like of a primer and a probe included in the gene panel. Therefore, in steps I and II shown in
For example, as shown in
However, there are conceivable cases where an ideal number of samples that should be applied to the flow cell cannot be obtained, such as when the number of subject-derived samples is small. In addition, in some cases, it could become necessary to perform analysis again on only some of the samples that have already been analyzed. If the number of samples to be subjected to one sequencing run varies, the data amount of sequence information obtained per sample will vary. This is because, since the total amount of nucleic acid contained in the libraries to be applied to a flow cell needs to be constant, the amount of nucleic acid per sample in the total amount of nucleic acid molecules applied to the flow cell will vary.
For example, when sequencing is performed using a number of samples (for example, 16 samples) that is ⅓ an ideal number of samples (for example, 48 samples), the amount of nucleic acid per sample is three times the amount obtained when sequencing is performed using the ideal number of samples. As a result, the data amount of sequence information obtained per sample is likely to be three times the data amount obtained when sequencing is performed using the ideal number of samples.
In order to keep the quality of the analysis result of gene base sequences constant, it is desirable that the data amount of sequence information obtained per sample does not vary for each sequencing run. However, if the data amount of sequence information obtained per sample varies due to variation of the number of samples subjected to a sequencing run, it becomes necessary, in accordance with the result, to modify the existing analysis program used in analysis of gene base sequences, for example.
In order to utilize NGS in the medical field to help determination of diagnoses and therapies for diseases of subjects, it is important to output analysis results of a quality that is always constant. Thus, it is desirable that, even when the number of samples to be subjected to one sequencing run varies, the data amount of sequence information obtained per sample is kept constant, and the existing analysis program is used as it is.
The scope of the present invention is defined solely by the appended claims, and is not affected to any degree by the statements within this summary.
In order to solve the above problem, an analysis method according to one aspect of the present invention includes obtaining sequence information of nucleic acid contained in a measurement sample prepared by mixing at least one sample that contains subject-derived nucleic acid and a sample that contains non-subject-derived nucleic acid such that the measurement sample contains a previously determined amount of nucleic acid (S1); and outputting sequence information in which a data amount of sequence information per sample that contains subject-derived nucleic acid is a predetermined amount irrespective of the number of samples that contain subject-derived nucleic acid and that have been used in preparing the measurement sample (S2).
Here, the “subject” means, for example, a patient or the like who has a gene test such as a panel test. The “measurement sample” means a sample to be prepared so as to be subjected to sequencing. The “previously determined amount of nucleic acid” means the amount of nucleic acid determined on the basis of a protocol recommended for a sequencer 2 to be used and reagents to be used. That is, the “previously determined amount of nucleic acid” is an amount of nucleic acid realized when the number of samples is not smaller than the recommended number of samples to be subjected to one sequencing run. The “predetermined amount” means the data amount of sequence information per sample obtained when the measurement sample is prepared by use of a recommended number of samples.
According to the above configuration, the measurement sample is prepared so as to contain a previously determined amount of nucleic acid by mixing at least one sample that contains subject-derived nucleic acid and a sample that contains non-subject-derived nucleic acid. Then, sequence information of the nucleic acid of the measurement sample is obtained, and sequence information in which the data amount of sequence information per sample is the predetermined amount is outputted.
In order to keep the reliability of the analysis result of the sequence information constant, the quality of the sequence information needs to be appropriately evaluated. If the above analysis method is employed, even when the number of samples that contain subject-derived nucleic acid for preparing a measurement sample varies in gene tests, the variation in the data amount of sequence information per sample can be kept in a predetermined range, and an analysis result of constant quality can be outputted. Thus, even when the number of samples that contain subject-derived nucleic acid for preparing a measurement sample to be subjected to one sequencing run is smaller than the recommended number of samples, the variation in the data amount of sequence information per sample can be kept in a predetermined range.
In the preparing of the measurement sample (S304b), the amount of nucleic acid derived from each sample that contains subject-derived nucleic acid in the measurement sample may be substantially identical.
In the outputting of the sequence information (S2), the data amount of the sequence information per sample may account for a predetermined proportion in a data amount of the obtained sequence information of the nucleic acid of the measurement sample, irrespective of the number of samples that contain subject-derived nucleic acid and that have been used in the preparing of the measurement sample.
When the number of samples that contain subject-derived nucleic acid and that have been used in the preparing of the measurement sample has been changed, variation in the data amount of the sequence information per sample may be in a range of ±10%.
According to the above configuration, the quality of the analysis result of gene base sequences can be kept in a range allowable for a test result of a gene test such as a panel test.
A data amount of sequence information of the non-subject-derived nucleic acid in the sequence information obtained in the obtaining of the sequence information (S1) may be greater than or equal to the data amount of the sequence information per sample.
Even when the data amount of the sequence information of the non-subject-derived nucleic acid is increased, the quality of the sequence information of the subject-derived nucleic acid is not influenced.
In the above configuration, the predetermined proportion is not dependent on the number of samples that contain subject-derived nucleic acid and that have been used in the preparing of the measurement sample (S304b).
A first measurement sample may be prepared by mixing samples that contain nucleic acid derived from a first subject group and a sample that contains non-subject-derived nucleic acid, and a second measurement sample may be prepared by mixing samples that contain nucleic acid derived from a second subject group and a sample that contains non-subject-derived nucleic acid. The number of subjects of the first subject group and the number of subjects of the second subject group may be different from each other.
Even when the number of samples that contain nucleic acid derived from a subject group to be used in preparing a measurement sample varies for each measurement sample, the variation of the data amount of sequence information per sample is kept in a predetermined range. Thus, the quality of the sequence information of subject-derived nucleic acid is not influenced.
An amount of the non-subject-derived nucleic acid in the measurement sample may be changed in accordance with the number of samples that contain subject-derived nucleic acid and that have been used in the preparing of the measurement sample (S304b).
The amount of the nucleic acid may be the number of moles of the nucleic acid. The number of moles of nucleic acid can be calculated on the basis of measurement values such as absorbance at 260 nm, an average molecular weight, a molar absorption coefficient of nucleic acid, and the like.
In the preparing of the measurement sample (S304b), an amount of nucleic acid contained in each measurement sample may be the previously determined amount of nucleic acid.
Variation in an amount of nucleic acid per sample included in the measurement sample may be in a range of ±10%.
According to the above configuration, the quality of the sis result of gene base sequences can be kept in a range allowable for a test result of a gene test such as a panel test.
An amount of the non-subject-derived nucleic acid contained in the measurement sample may be greater than or equal to an amount of nucleic acid per sample contained in the measurement sample.
Accordingly, even when the number of samples that contain subject-derived nucleic acid is not sufficient in the preparing of a measurement sample, the insufficient amount can be compensated for by the non-subject-derived nucleic acid.
In order to solve the above problem, an information processing apparatus (1) according to another aspect of the present invention includes a controller (11) and a storage unit (12). The controller (11) is programmed to obtain sequence information of nucleic acid contained in a measurement sample prepared by mixing at least one sample that contains subject-derived nucleic acid and a sample that contains non-subject-derived nucleic acid such that the measurement sample contains a previously determined amount of nucleic acid, and store the sequence information into the storage unit (12); and output sequence information in which a data amount of sequence information per sample that contains subject-derived nucleic acid is a predetermined amount irrespective of the number of samples that contain subject-derived nucleic acid and that have been used in preparing the measurement sample.
According to the above configuration, the information processing apparatus (1) analyzes sequence information in which the data amount of sequence information per sample is the predetermined amount irrespective of the number of samples that contain subject-derived nucleic acid and that have been used in preparing the measurement sample.
Thus, for example, even when the number of samples that contain subject-derived nucleic acid for preparing a measurement sample varies in gene tests, the variation in the data amount of sequence information per sample can be kept in a predetermined range, and an analysis result of constant quality can be outputted.
In order to solve the above problem, a gene analysis system (100) according to another aspect of the present invention includes a sequencer (2) configured to read sequence information of nucleic acid of a measurement sample prepared by mixing at least one sample that contains subject-derived nucleic acid and a sample that contains non-subject-derived nucleic acid such that the measurement sample contains a previously determined amount of nucleic acid; and an information processing apparatus (1) configured to obtain the sequence information and output a result of analyzing the sequence information. Irrespective of the number of samples that contain subject-derived nucleic acid and that have been used in preparing the measurement sample, a data amount of sequence information per sample in the sequence information is a predetermined amount.
According to the above configuration, the sequencer (2) performs sequencing on a measurement sample prepared by mixing at least one sample that contains subject-derived nucleic acid and a sample that contains non-subject-derived nucleic acid such that the measurement sample contains a previously determined amount of nucleic acid. Then, the information processing apparatus (1) analyzes sequence information in which the data amount of sequence information per sample is a predetermined amount irrespective of the number of samples that contain subject-derived nucleic acid and that have been used in preparing the measurement sample.
Thus, for example, even when the number of samples that contain subject-derived nucleic acid for preparing a measurement sample varies in gene tests, the variation in the data amount of sequence information per sample can be kept in a predetermined range, and an analysis result of constant quality can be outputted.
In order to solve the above problem, a program according to another aspect of the present invention causes a computer to perform obtaining sequence information of nucleic acid in a measurement sample prepared by mixing at least one sample that contains subject-derived nucleic acid and a sample that contains non-subject-derived nucleic acid such that the measurement sample contains a previously determined amount of nucleic acid, wherein, in the sequence information, a data amount of sequence information per sample accounts for a predetermined proportion in a data amount of sequence information of nucleic acid of the measurement sample (S1); analyzing the sequence information (S109); and outputting an analysis result (S111). In the obtaining of the sequence information of the nucleic acid (S1), a data amount of sequence information per sample that contains subject-derived nucleic acid is a predetermined amount irrespective of the number of samples that contain subject-derived nucleic acid and that have been used in preparing the measurement sample.
According to this configuration, for example, even when the number of samples that contain subject-derived nucleic acid for preparing a measurement sample varies in gene tests, the variation in the data amount of sequence information per sample can be kept in a predetermined range, and an analysis result of constant quality can be outputted.
A computer-readable storage medium having the above program stored therein is also included in the scope of the present invention.
One aspect of the present invention can also be described as follows.
An analysis method according to one aspect of the present invention includes obtaining sequence information of nucleic acid contained in a measurement sample prepared by mixing at least one sample that contains subject-derived nucleic acid and a sample that contains non-subject-derived nucleic acid such that the measurement sample contains a previously determined amount of nucleic acid (S1); and performing analysis on the obtained sequence information (S3). The performing of the analysis includes performing analysis on sequence information of the subject-derived nucleic acid in the obtained sequence information (S52), and not performing, on sequence information of the non-subject-derived nucleic acid in the obtained sequence information, at least a part of the analysis to be performed on the sequence information of the subject-derived nucleic acid (S53).
Here, the “subject” means, for example, a patient or the like who has a gene test such as a panel test. The “measurement sample” means a sample to be prepared so as to be subjected to sequencing. The “previously determined amount of nucleic acid” means the amount of nucleic acid determined on the basis of a protocol recommended for a sequencer 2 to be used and reagents to be used. That is, the “previously determined amount of nucleic acid” is an amount of nucleic acid realized when the number of samples is not smaller than the recommended number of samples to be subjected to one sequencing run. The “number of samples” means the number of samples whose sequence information is individually obtained. For example, in a case where one sample that contains nucleic acid extracted from a tissue and one sample that contains nucleic acid extracted from blood are prepared for one subject, the number of samples per subject is 2.
According to the above configuration, the measurement sample is prepared by mixing at least one sample that contains subject-derived nucleic acid, and a sample that contains non-subject-derived nucleic acid such that the measurement sample contains a previously determined amount of nucleic acid. Then, analysis is performed on sequence information of the subject-derived nucleic acid in the obtained sequence information, and at least a part of the analysis to be performed on the sequence information of the subject-derived nucleic acid is not performed on sequence information of the non-subject-derived nucleic acid.
If the above analysis method is employed, even when the number of samples that contain subject-derived nucleic acid for preparing a measurement sample varies in gene tests, analysis of constant quality can be efficiently performed on the sequence information of the subject-derived nucleic acid.
The analysis method may be configured such that the sequence information of the subject-derived nucleic acid includes an index sequence, and analysis is performed on sequence information that includes the index sequence, in the sequence information of the nucleic acid of the measurement sample.
The analysis method may be configured such that the sequence information of the subject-derived nucleic acid includes a plurality of pieces of sequence information of nucleic acid derived from a plurality of subjects, and sequence information of nucleic acid derived from a different subject includes a different index sequence.
The analysis method may be configured such that the sequence information of the non-subject-derived nucleic acid does not include an index sequence.
The analysis method may be configured such that analysis is performed on sequence information that includes an index sequence in the sequence information of the nucleic acid of the measurement sample (S52), and at least a part of the analysis to be performed on the sequence information that includes the index sequence is not performed on sequence information that does not include the index sequence (S53).
The analysis method may be configured such that the sequence information of the nucleic acid of the measurement sample includes sequence information that includes a first index sequence and sequence information that includes a second index sequence different from the first index sequence, and analysis is performed on the sequence information that includes the first index sequence, and at least a part of the analysis to be performed on the sequence information that includes the first index sequence is not performed on the sequence information that includes the second index sequence.
In the above configuration, the analysis may include obtaining information related to a gene of the subject, on the basis of the sequence information of the subject-derived nucleic acid. In the above configuration, information related to a gene of the subject may include a gene name corresponding to the sequence information and mutation information of the gene.
The preparing of the measurement sample (S304b) may include preparing a measurement sample having further added thereto a quality control sample for evaluating quality of sequence information, and the analysis method may further include performing a process for obtaining information related to quality of the measurement sample from sequence information of the quality control sample (S110).
The analysis method may be configured such that nucleic acid of the quality control sample is identical to the non-subject-derived nucleic acid, and may further include performing a process for obtaining information related to quality, on at least a part of the sequence information of the non-subject-derived nucleic acid in the sequence information of the nucleic acid of the measurement sample.
Irrespective of the number of samples that contain subject-derived nucleic acid and that have been used in the preparing of the measurement sample (S304b), an amount of nucleic acid derived from each sample in the measurement sample may be substantially identical.
A data amount of sequence information per sample in the sequence information, of the nucleic acid of the measurement sample, obtained in the obtaining of the sequence information (S1) may be substantially identical.
Accordingly, the quality of the analysis of sequence information of subject-derived nucleic acid can be kept at a constant level.
A data amount of sequence information per sample in the sequence information, of the nucleic acid of the measurement sample, obtained in the obtaining of the sequence information (S1) may account for a predetermined proportion in a data amount of the sequence information of the nucleic acid of the measurement sample, irrespective of the number of samples that contain subject-derived nucleic acid and that have been used in the preparing of the measurement sample.
Here, the predetermined proportion is a value that is determined in accordance with the number, of samples containing subject-derived nucleic acid, which is recommended for preparing a measurement sample, for example.
When the number of samples that contain subject-derived nucleic acid and that have been used in the preparing of the measurement sample has been changed, variation in a data amount of sequence information per sample in the sequence information of the nucleic acid of the measurement sample may be in a range of ±10%.
According to the above configuration, the quality of the analysis result of gene base sequences can be kept in a range allowable for a test result of a gene test such as a panel test.
A data amount of the sequence information of the non-subject-derived nucleic acid in the sequence information obtained in the obtaining of the sequence information (S1) may be greater than or equal to a data amount of sequence information per sample in the sequence information, of the nucleic acid of the measurement sample, obtained in the obtaining of the sequence information (S1).
Even when the data amount of the sequence information of the non-subject-derived nucleic acid is increased, the quality of the sequence information of the subject-derived nucleic acid is not influenced.
The amount of the nucleic acid may be the number of moles of the nucleic acid. The number of moles of nucleic acid can be calculated on the basis of measurement values such as absorbance at 260 nm, an average molecular weight, a molar absorption coefficient of nucleic acid, and the like.
The obtaining of the sequence information of the nucleic acid contained in the measurement sample (S1) may include obtaining sequence information of nucleic acid of the measurement sample, captured by capture molecules for capturing the nucleic acid. Each capture molecule may include a base sequence complementary to at least a part of the nucleic acid contained in the measurement sample.
The sequence information may be a base sequence of the nucleic acid read by a sequencer.
In order to solve the above problem, an information processing apparatus (1) according to another aspect of the present invention includes a controller (11) and a storage unit (12). The controller (11) is configured to obtain sequence information of nucleic acid of a measurement sample prepared by mixing at least one sample that contains subject-derived nucleic acid and a sample that contains non-subject-derived nucleic acid such that the measurement sample contains a previously determined amount of nucleic acid, and store the sequence information into the storage unit (12), and perform analysis on the obtained sequence information. The controller performs analysis on sequence information of the subject-derived nucleic acid in the obtained sequence information, and does not perform, on sequence information of the non-subject-derived nucleic acid in the obtained sequence information, at least a part of the analysis to be performed on the sequence information of the subject-derived nucleic acid.
According to the above configuration, the information processing apparatus (1) obtains sequence information, performs analysis on sequence information of subject-derived nucleic acid, and does not perform, on sequence information of non-subject-derived nucleic acid, at least a part of the analysis to be performed on the sequence information of the subject-derived nucleic acid. Accordingly, for example, even when the number of samples that contain subject-derived nucleic acid for preparing a measurement sample varies in gene tests, analysis of constant quality can be efficiently performed.
The analysis to be performed on the obtained sequence information may include an alignment process (S12) for mapping the obtained sequence information on a reference sequence, and may be configured such that the alignment process is not performed on the sequence information of the non-subject-derived nucleic acid.
For example, in a case where “PhiX DNA” (Illumina, Inc.) which is nucleic acid derived from bacteriophage is used as the sample that contains non-subject-derived nucleic acid, there is no need to perform the alignment process. According to the above configuration, unnecessary processes can be appropriately omitted.
The analysis to be performed on the obtained sequence information may include a mutation extraction process (S14) for extracting a mutation of nucleic acid, and may be configured such that the mutation extraction process is not performed on the sequence information of the non-subject-derived nucleic acid.
For example, when a quality control sample for evaluating the quality of sequence information is used as the sample that contains non-subject-derived nucleic acid, there is no need to perform the mutation extraction process. According to the above configuration, unnecessary processes can be appropriately omitted.
In order to solve the above problem, a gene analysis system (100) according to another aspect of the present invention includes a sequencer (2) configured to read sequence information of nucleic acid of a measurement sample prepared by mixing at least one sample that contains subject-derived nucleic acid and a sample that contains non-subject-derived nucleic acid such that the measurement sample contains a previously determined amount of nucleic acid; and an information processing apparatus (1) configured to perform analysis on the sequence information that has been obtained. The information processing apparatus performs analysis on sequence information of the subject-derived nucleic acid in the obtained sequence information, and does not perform, on sequence information of the non-subject-derived nucleic acid in the obtained sequence information, at least a part of the analysis to be performed on the sequence information of the subject-derived nucleic acid.
According to the above configuration, the sequencer (2) performs sequencing on a measurement sample prepared by mixing at least one sample that contains subject-derived nucleic acid and a sample that contains non-subject-derived nucleic acid such that the measurement sample contains a previously determined amount of nucleic acid. Then, the information processing apparatus (1) obtains sequence information, performs analysis on sequence information of the subject-derived nucleic acid, and does not perform, on sequence information of the non-subject-derived nucleic acid in the obtained sequence information, at least a part of the analysis to be performed on the sequence information of the subject-derived nucleic acid.
Accordingly, for example, even when the number of samples that contain subject-derived nucleic acid for preparing a measurement sample varies in gene tests, analysis of constant quality can be efficiently performed.
In order to solve the above problem, a program according to another aspect of the present invention is configured to cause a computer to perform obtaining sequence information of nucleic acid of a measurement sample prepared by mixing at least one sample that contains subject-derived nucleic acid and a sample that contains non-subject-derived nucleic acid such that the measurement sample contains a previously determined amount of nucleic acid (S1); and performing analysis on the obtained sequence information (S52). In the performing of the analysis, the computer performs analysis on sequence information of the subject-derived nucleic acid in the obtained sequence information (S52), and does not perform, on sequence information of the non-subject-derived nucleic acid in the obtained sequence information, at least a part of the analysis to be performed on the sequence information of the subject-derived nucleic acid (S53).
According to this configuration, for example, even when the number of samples that contain subject-derived nucleic acid for preparing a measurement sample varies in gene tests, analysis of constant quality can be efficiently performed.
A computer-readable non-transitory storage medium having the above program stored therein is also included in the scope of the present invention.
An analysis method according to one embodiment of the present disclosure is a method for outputting an analysis result of constant quality in gene tests. When this analysis method is applied, even if the number of samples subjected to one sequencing run is smaller than a recommended number of samples, it is possible to prevent the data amount of sequence information per sample from significantly varying and exceeding a predetermined range, thereby being able to output an analysis result of constant quality.
First, how a gene test is performed is described with reference to
In a case where it would be advantageous for a subject to have a gene test in order to determine a diagnosis and a therapeutic strategy for the subject, the attending physician of the subject explains this to the subject, and obtains consent of the subject to use a gene test (step S91). When the subject consents, a tissue and blood of the subject, which are each used as a sample in the gene test, are collected (step S92). Each collected sample is stored in a predetermined container.
Next, pretreatment of genes extracted from the sample, and sequencing are performed (step S93). Then, sequence information obtained as a result of the sequencing is analyzed, any abnormality in each gene to be analyzed is detected (step S94), and a report that includes a quality evaluation index indicating the quality of the gene test and information related to the detected abnormality is created (step S95).
Then, the significance of the information included in the report is determined by an expert panel composed of a plurality of specialists of gene tests (step S96). The attending physician of the subject explains to the subject the result of the gene test on the basis of the report, and selects a therapeutic strategy after having a discussion with the subject (step S97).
The outline of the analysis method according to one embodiment of the present disclosure is described with reference to
Step S1 is a step of obtaining sequence information of nucleic acid contained in a measurement sample prepared so as to contain a previously determined amount of nucleic acid. The measurement sample can be prepared by use of a sample that contains subject-derived nucleic acid. The sample containing subject-derived nucleic acid is obtained by extracting nucleic acid such as DNA and RNA from blood, a tissue, and the like collected from a subject (for example, a patient) by use of a known method, for example. Sequencing includes processes of reading the base sequence of fragments (DNA fragments in a case where DNA is to be analyzed) of one or a plurality of genes to be analyzed which have been collected in the pretreatment, and generating sequence information.
The measurement sample means a sample prepared so as to be subjected to sequencing performed by a sequencer 2. When the amount of nucleic acid of the subject-derived samples is less than a previously determined amount of nucleic acid, the measurement sample is prepared so as to contain the previously determined amount of nucleic acid, by mixing a non-subject-derived sample thereto.
Here, the “non-subject-derived nucleic acid” means nucleic acid or the like derived from a virus, a microorganism, a plant, or an insect, for example. As the “non-subject-derived nucleic acid”, “PhiX DNA” or the like that is provided by Illumina, Inc. can be suitably used, for example. PhiX DNA is bacteriophage-derived nucleic acid. PhiX DNA has a small molecular weight and is highly diverse in sequence.
The “previously determined amount of nucleic acid” means the amount of nucleic acid determined on the basis of a protocol recommended for the sequencer 2 to be used and reagents to be used. That is, the “previously determined amount of nucleic acid” is the amount of nucleic acid realized when the number of samples is not smaller than the recommended number of samples to be subjected to one sequencing run. The “previously determined amount of nucleic acid” is to ensure that the quality of the analysis result of base sequences obtained as a result of sequencing is at a certain level or higher. The “previously determined amount of nucleic acid” may be an amount specified between an upper limit amount and a lower limit amount.
The sequence information is sequence information of nucleic acid captured by capture molecules provided on the surface of the flow path in a flow cell, for example. Except for an operation of applying the measurement sample to a predetermined flow cell, and an operation of setting the flow cell on the sequencer 2, obtainment of sequence information is performed by the sequencer 2. In a certain case, capture molecules for capturing nucleic acid are immobilized on the surface of a solid phase in a predetermined flow cell or the like recommended to be used in the sequencer 2. The capture molecules include base sequences that are complementary to at least a part of nucleic acid contained in the measurement sample.
Some of the operations included in step S1 above (for example, the operation of applying the measurement sample to the flow cell, and the operation of setting the flow cell on the sequencer 2) are performed by an operator of the sequencer 2 or a person in charge of the test. However, some of the operations in step S1 may be performed by one or a plurality of working robots as described below.
Step S2 is a step of outputting sequence information in which the data amount of sequence information per sample is a predetermined amount, irrespective of the number of samples containing subject-derived nucleic acid that have been used in preparing the measurement sample. Here, the “predetermined amount” may be an amount specified between an upper limit amount and a lower limit amount. The predetermined data amount is a data amount that accounts for a predetermined proportion in the data amount of sequence information obtained in step S1. This step is a part of sequencing and is performed by the sequencer 2.
For example, if the number, of samples containing subject-derived nucleic acid, recommended for preparation of a measurement sample is 3, the data amount of sequence information per sample accounts for about ⅓ (i.e., about 33%) in the data amount of the obtained sequence information. If the number, of samples containing subject-derived nucleic acid, recommended for preparation of a measurement sample is 8, the data amount of sequence information per sample is about ⅛ (i.e., about 12.5%) in the data amount of the obtained sequence information. Thus, the predetermined proportion is a value that changes according to the number, of samples containing subject-derived nucleic acid, recommended for preparation of a measurement sample.
When the number of samples containing subject-derived nucleic acid that have been used in preparation of the measurement sample has been changed, the variation in the data amount of sequence information per sample is preferably in a range of ±10%. In this case, the data amount of sequence information of non-subject-derived nucleic acid in the obtained sequence information may be at least an amount or more that corresponds to the data amount of sequence information of nucleic acid per sample contained in the sequence information.
Step S3 is a step of analyzing sequence information and outputting an analysis result. This step is performed by an information processing apparatus 1. The information processing apparatus 1 is a computer that performs analysis on sequence information, to be analyzed, which has been generated and outputted by the sequencer 2 on the basis of the base sequence data having been read. The base sequence data means polynucleotide sequence data obtained by sequencing, and is base sequence data outputted by the sequencer 2.
In order to keep the quality of the analysis result of gene base sequences constant, the quality of sequence information needs to be appropriately evaluated. However, if the data amount of sequence information obtained per sample varies, it also becomes necessary to vary, in accordance with the variation of the data amount, the indexes for evaluating the quality of sequence information.
For example, one of the indexes for evaluating the quality of sequence information is depth. The depth is a quality evaluation index based on the total number of pieces of sequence information obtained by reading each base contained in each gene to be analyzed. In general, a reference value for depth is finely set in advance for a case where sequencing is performed using an ideal number of samples, and the quality of sequence information is evaluated according to whether or not the depth is not less than a given reference value. In order to keep the quality of the analysis result of gene base sequences constant, the reference value for depth in an existing analysis program to be used in analysis of gene base sequences needs be changed in accordance with variation of the number of samples to be subjected to one sequencing run.
If steps S1 to S3 shown in
In addition, if the data amount of sequence information obtained per sample varies, it becomes necessary to vary, in accordance with the variation, the criteria for detecting abnormalities in genes to be analyzed.
For example, gene abnormalities to be detected in a panel test include polymorphisms such as single nucleotide polymorphism (SNP) and copy number variation (CNV). In order to prevent occurrence of variation in the accuracy of detecting polymorphisms due to variation of the data amount of sequence information obtained per sample, it is necessary to set criteria for detecting polymorphisms in the existing analysis program to be used in analysis of gene base sequences, in accordance with the number of samples to be subjected to one sequencing run.
If steps S1 to S3 shown in
The information processing apparatus 1 may be configured to have the functions of the auxiliary apparatus 2a shown in
The flow of the process in which the information processing apparatus 1 having the functions of the auxiliary apparatus 2a performs analysis on sequence information of subject-derived nucleic acid is described with reference to
In order to allow determination whether or not the sequence information is sequence information of subject-derived nucleic acid, the sequence information of subject-derived nucleic acid in the sequence information obtained in step S1 of
In the following, one embodiment of the present disclosure is described in detail.
First, the outline of the gene analysis system 100 including the information processing apparatus 1 according to one embodiment of the present disclosure is described with reference to
The gene analysis system 100 shown in
The test institution 120 tests/analyzes the sample provided from the medical institution 210, creates a report based on the analysis result, and provides the report to the medical institution 210. In the example shown in
The analysis system management institution 130 manages general analyses that are performed in each test institution 120 that uses the gene analysis system 100. The analysis system management institution 130 may be the same institution as the test institution 120.
The medical institution 210 is an institution in which doctors, nurses, pharmacists, and the like perform medical activities such as providing diagnosis, therapy, and dispensation to patients, and examples of the medical institution 210 include hospitals, clinics, and pharmacies.
Although
Next, the flow of processes performed in an application example of the gene analysis system 100 shown in
First, a test institution 120 that is going to use the gene analysis system 100 introduces the information processing apparatus 1. Then, the test institution 120 files an application for use of the gene analysis system 100 to the analysis system management institution 130 (step S101). S101 can be omitted. For example, in a case where the analysis system management institution 130 is identical to the test institution 120, S101 is omitted.
The test institution 120 and the analysis system management institution 130 can conclude in advance a desired contract with regard to use of the gene analysis system 100, from among a plurality of contract types. For example, service contents provided from the analysis system management institution 130 to the test institution 120, a method of determination of a system usage fee charged to the test institution 120 by the analysis system management institution 130, a method of payment for a system usage fee, and the like may be selected from a plurality of different contract types. The management server 3 of the analysis system management institution 130 specifies the content of the contract concluded with the test institution 120, in response to the application filed from the test institution 120 (step S102). S102 can be omitted. For example, in a case where the analysis system management institution 130 is identical to the test institution 120, S102 is omitted.
Next, the management server 3, managed by the analysis system management institution 130, provides a test institution ID to the information processing apparatus 1 of the test institution 120 having concluded the contract, and starts providing various services (step S103). S103 can be omitted. For example, in a case where the analysis system management institution 130 is identical to the test institution 120, S103 is omitted. In a case where the analysis system management institution 130 is identical to the test institution 120, the test institution ID and various services are managed by the test institution 120 itself.
The information processing apparatus 1 receives information, programs, and the like for controlling the analysis process of gene base sequences, creation of a report based on the analysis result, and the like, from the management server 3. Accordingly, the test institution 120 becomes able to receive various services from the analysis system management institution 130. The information processing apparatus 1 can output an analysis result, a report, and the like based on the inputted information related to a gene panel (hereinafter, also referred to as gene panel information). In a case where the analysis system management institution 130 is identical to the test institution 120, the test institution 120 itself manages information, programs, and the like for controlling the analysis process of gene base sequences, creation of a report based on the analysis result, and the like.
In many cases, a gene panel includes a set of reagents such as a primer and a probe. The gene panel may be used for analyzing polymorphisms, such as mutation, single nucleotide polymorphism (SNP), and copy number variation (copy number abnormality) (CNV), that have occurred in genes. The gene panel may be used for outputting information regarding the amount of mutations in the entirety of genes to be analyzed (also referred to as Tumor Mutation Burden, or the like), and for calculation of the methylation frequency.
Herein, a “gene panel” means a gene panel that allows batch analysis of a plurality of abnormalities in a plurality of genes, and that allows a test of samples related to a plurality of diseases. Such a gene panel is also referred to as a “multi-panel” or a “large panel”, and is used for analyzing genes that are related to a plurality of diseases. In such a gene panel, base sequences read from exon regions each having a base length of 10 Mb (10 million bases) or greater are to be analyzed.
In the medical institution 210, a doctor or the like collects a sample such as blood and a tissue of a lesion site of a subject as necessary. When analysis of the collected sample is requested to the test institution 120, an analysis request is transmitted from a communication terminal 5 provided in the medical institution 210, for example (step S105). When requesting analysis of a sample to the test institution 120, the medical institution 210 transmits an analysis request and provides the test institution 120 with a sample ID provided for each sample. The sample ID provided for each sample associates the sample with, for example, information regarding the subject from whom the sample has been collected (for example, patient ID), and identification information for identifying the disease of the subject (for example, disease name and disease ID). A subject ID, a disease ID, and the like may be transmitted, together with the sample ID, from the medical institution 210 to the test institution 120. In the test institution 120, the sample ID and the subject ID are associated with the disease ID, to be managed.
In the following, an example case in which the medical institution 210 requests a panel test analysis to the test institution 120 is described. The panel test is not limited to laboratory tests, but includes tests for research use.
Herein, a “subject” means a human subject. However, the concept of the present disclosure can be applied to a genome derived from an organism such as any animal other than a human, and is useful also in the fields such as medical care, veterinary medicine, and zoological science.
When a gene panel test is requested from the medical institution 210, a desired gene panel may be designated. Therefore, gene panel information can be included in the analysis request transmitted from the medical institution 210 in step S105 shown in
The information processing apparatus 1 receives the analysis request from the medical institution 210 (S106). Further, the information processing apparatus 1 receives a sample from the medical institution 210, which is the transmission source of the analysis request. In the medical institution 210 (and the test institution 120), the subject name, the subject ID, the disease name, the disease ID, the sample ID, and the like are recorded/managed in association with one another.
Each sample provided from the medical institution 210 is stored in a container as shown in
Alternatively, as shown in
There are a plurality of gene panels that can be used in analysis that the test institution 120 is requested to perform by the medical institution 210, and a gene group to be analyzed is fixed for each gene panel. The test institution 120 can selectively use a plurality of gene panels so as to suit the purpose of the analysis. That is, for a first sample provided from the medical institution 210, a first gene panel can be used in order to analyze a first gene group to be analyzed, and for a second sample, a second gene panel can be used in order to analyze a second gene group to be analyzed.
The information processing apparatus 1 receives, from an operator, an input of gene panel information of a gene panel to be used in order to analyze the sample (step S107).
In the test institution 120, the received sample is subjected to pretreatment using the gene panel, and sequencing is performed by use of the sequencer 2 (step S108).
In addition, in the test institution 120, separately from sequencing performed on subject-derived samples, a predetermined quality control sample corresponding to the gene panel is subjected to pretreatment using the gene panel, and sequencing is performed by use of the sequencer 2 (step S108), whereby accuracy control is performed.
The result obtained by subjecting the quality control sample to a gene test including pretreatment, sequencing, sequence analysis, and the like is used as a quality evaluation index of the panel test.
Each gene panel may be associated with one or a plurality of quality control samples. Alternatively, for example, for each gene panel, a corresponding quality control sample may be prepared in advance. Further, a quality control sample may be measured individually, or may be measured together with a sample provided from the medical institution 210.
The pretreatment is a series of processes for preparing a measurement sample. The pretreatment corresponds to steps S1 to S2 in
The sequencer 2 may output, to the information processing apparatus, sequence information including a quality score which is a quality evaluation index for the step of reading gene base sequences. The sequencer 2 may output, to the information processing apparatus 1, a cluster concentration which is a quality evaluation index for a step of amplifying DNA fragments to be analyzed. The “quality score” and the “cluster concentration” are described later.
The information processing apparatus 1 obtains sequence information from the sequencer 2 and analyzes gene base sequences (step S109).
The quality control sample is also processed in the same steps as those performed in the panel test on the sample provided from the medical institution 210. Thus, gene sequence information of the quality control sample is also analyzed in the same manner as that of the sample provided from the medical institution 210. On the basis of the result of analyzing the quality control sample, a quality evaluation index for evaluating the quality of the panel test is generated.
Next, the information processing apparatus 1 evaluates the quality of the panel test on the basis of the quality evaluation index generated by a quality-control unit 117 (step S110). Specifically, the information processing apparatus 1 can evaluate the quality of each panel test on the basis of a result of comparison between the generated quality evaluation index and an evaluation criterion set for each quality evaluation index stored in quality evaluation criteria 126 shown in
The quality control sample is a sample that contains non-subject-derived nucleic acid. The information processing apparatus 1 may perform a process for obtaining information related to the quality, on at least a part of sequence information of the sample that contains non-subject-derived nucleic acid, in the sequence information of nucleic acid of the measurement sample. In this case, at least a part of sequence information of the sample that contains non-subject-derived nucleic acid is used as a substitute for sequence information of the quality control sample.
The information processing apparatus 1 creates a report on the basis of the analysis result obtained in step S109, and the index generated on the basis of the result of analyzing the quality control sample (step S111), and transmits the created report to the communication terminal 5 (step S112). For example, the report may include data of an alignment result of the sequence information; data itself of the result of analysis by the information processing apparatus 1, such as data regarding identified gene mutations or the like; and information regarding the quality of the panel test.
The created report may be printed in the test institution 120. For example, the test institution 120 may send the created report in the form of a paper medium to the medical institution 210.
The information processing apparatus 1 of the test institution 120 that uses the gene analysis system 100 notifies the management server 3 of the gene panel information of the gene panel having been used in the analysis, information regarding the analyzed genes, an analysis record, the quality evaluation index generated for the gene test having been performed, and the like (step S114). S114 can be omitted. For example, in a case where the analysis system management institution 130 is identical to the test institution 120, S114 is omitted. In this case, the test institution 120 itself manages the analysis record, the quality evaluation index, and the like.
The management server 3 obtains a test institution ID, a gene panel ID, a gene ID, an analysis record, and the like, via, for example, a communication line 4 from the information processing apparatus 1 of each test institution 120 that uses the gene analysis system 100. The management server 3 stores the obtained test institution ID, gene panel ID, gene ID, analysis record, quality evaluation index, and the like so as to be associated with one another (step S115). S115 can be omitted. For example, in a case where the analysis system management institution 130 is identical to the test institution 120, S115 is omitted. In this case, the test institution 120 itself manages the analysis record, the quality evaluation index, and the like.
The test institution ID is information for specifying the test institution 120 that performs gene sequence analysis. The test institution ID may be an operator ID which is identification information provided to each operator who belongs to the test institution 120 that uses the information processing apparatus 1.
The gene panel ID is identification information provided for specifying a gene panel to be used in analysis of genes to be analyzed. The gene panel ID provided to the gene panel is associated with a gene panel name, the name of the company that provides the gene panel, and the like.
The gene ID is identification information provided to each gene for specifying a gene to be analyzed.
The analysis record is information regarding the analysis state of gene sequence information. For example, the analysis record may be the number of times of sequence analysis the analysis using a predetermined gene panel has been performed in the information processing apparatus 1, may be the number of genes that have been analyzed, or may be an accumulated total of the number of gene mutations that have been identified. Alternatively, the analysis record may be information regarding the amount of data that has been processed in the analysis.
The management server 3 aggregates, for each test institution 120, the analysis records in a predetermined period (for example, any period such as a day, week, month, or year) and determines a system usage fee in accordance with the aggregation result and the contract type (step S116). The analysis system management institution 130 may charge the determined system usage fee to the test institution 120, and request payment of the system usage fee to the analysis system management institution 130. S116 can be omitted. For example, in a case where the analysis system management institution 130 is identical to the test institution 120, S116 is omitted.
The gene analysis system 100 is a system for analyzing gene sequence information, and includes at least the information processing apparatus 1 and the management server 3. The information processing apparatus 1 is connected to the management server 3 via the communication line 4 such as an intranet and the Internet.
The sequencer 2 is a base sequence analyzing apparatus that is used in order to read the base sequences of genes contained in a sample.
The sequencer 2 according to the present embodiment is preferably a next generation sequencer that performs sequencing using a next generation sequencing technology, or a third-generation sequencer. The next generation sequencer denotes one of base sequence analyzing apparatuses which have been developed in recent years. The next generation sequencer has a significantly improved analytical capability realized by performing, in a flow cell, parallel processing of a large amount of a single DNA molecule or a DNA template that has been clonally amplified.
Sequencing technology usable in the present embodiment can be a sequencing technology that obtains a plurality of reads by reading the same region multiple times (deep sequencing).
Examples of the sequencing technology usable in the present embodiment include sequencing technologies that can obtain a large number of reads per sequencing run, such as ionic semiconductor sequencing, pyrosequencing, sequencing-by-synthesis using a reversible dye terminator, sequencing-by-ligation, and sequencing that uses probe ligation of oligonucleotides. The present disclosure may be applied to whole genome sequencing which does not analyze the base sequences of a specific region but analyzes the base sequences of the entire genome. The whole genome sequencing can be applied to a gene panel to be used for analyzing genes related to a plurality of diseases. The whole genome sequencing can read base sequences from exon regions each having a base length of 10 Mb (10 million bases) or greater.
The sequence primer to be used in sequencing is not limited in particular, and is set as appropriate on the basis of a sequence that is suitable for amplifying a target region. Reagents to be used in sequencing may also be suitably selected in accordance with the sequencing technology and the sequencer 2 to be used. The procedure from the pretreatment to the sequencing is described later by using a specific example.
Next, data stored in the management server 3 is described with reference to
In data 3A, the name of a test institution that uses the gene analysis system 100, and a test institution ID provided to the test institution are associated with each other. In data 3B, the type of contract concluded between the analysis system management institution 130 and a test institution 120, services to be provided to the test institution that has concluded the contract (for example, usable gene panel), and a system usage fee are associated with one another.
For example, in a case where a test institution “Institution P” has concluded a contract of “Plan 1” with the analysis system management institution 130, the analysis system management institution 130 charges the test institution P for a usage fee according to the number of times of operation. “The number of times of operation” is the number of times a panel test has been performed by the information processing apparatus 1, for example. When the test institution P starts using the gene analysis system 100, the test institution P logs in the gene analysis system 100 by using the test institution ID and a password of the test institution P. On the basis of the test institution ID inputted at the time of log in, the management server 3 can specify the test institution name, the contract type, and the like.
“Plan 3” is a higher-order plan of “Plan 1”. “Plan 3” is obtained by adding provision of auxiliary information usable for “CDx usage”, to “Plan 1”. Therefore, the cost for concluding a contract of “Plan 3” may be higher than the cost for concluding a contract of “Plan 1”.
CDx information necessary for creating a report that includes auxiliary information related to the efficacy of drugs applicable to companion diagnostics (CDx) is provided to the test institution that has concluded the contract of “Plan 3” (see S104 in
Data 3C to 3E are analysis records regarding the number of times of operation that was performed, genes that were analyzed, and the total number of gene mutations that were identified, by the test institution using the gene analysis system 100 in a period from Aug. 1, 2017 to Aug. 31, 2017. These analysis records are transmitted from the information processing apparatus 1 to the management server 3, and are stored in the management server 3. On the basis of the data of these analysis records, the analysis system management institution 130 determines a system usage fee to be charged to each test institution. The record aggregation period is not limited to that mentioned above. The records may be aggregated in any period such as a day, week, month, or year.
When the analysis system management institution 130 determines a system usage fee, the system usage fee may be changed depending on whether the gene panel that was used in the test was from a company that provides (for example, produces or sells) the gene panel. In this case, it is sufficient that data 3F is stored in the management server 3. In data 3F, the name of a company that provides gene panels, such as “Company A” or “Company B”, a gene panel ID, and an agreement regarding the system usage fee (for example, whether a system usage fee is required or not) are associated with one another.
An example in which “Institution P” concluded a contract of “Plan 1” with the analysis system management institution 130 and the analysis records are those shown in
The information processing apparatus 1 includes a controller 11 which obtains sequence information, to be analyzed, including base sequence data read by the sequencer 2 and gene panel information including a plurality of genes to be analyzed; and an output unit 13 which outputs a result of analysis, of the sequence information, based on the gene panel information obtained by the controller 11. The information processing apparatus 1 can be configured by use of a computer. For example, the controller 11 is implemented by a processor such as a CPU (central processing unit), and the storage unit 12 is implemented by a hard disk drive.
In the storage unit 12, a program for sequence analysis, a program for generating a single reference sequence, and the like are also stored. The output unit 13 includes a display, a printer, a speaker, and the like. An input unit 17 includes a keyboard, a mouse, a touch sensor, and the like. A device may be used that has both of the functions of an input unit and an output unit, such as a touch panel in which a touch sensor and a display are integrated. A communication unit 14 is an interface that allows the controller 11 to communicate with an external apparatus.
The information processing apparatus 1 includes the controller 11 which comprehensively controls the components of the information processing apparatus 1; the storage unit 12 which stores various kinds of data to be used by an analysis execution unit 110; the output unit 13; the communication unit 14; and the input unit 17. The controller 11 includes the analysis execution unit 110 and a management unit 116. Further, the analysis execution unit 110 includes a sequence data reading unit 111, an information selection unit 112, a data adjustment unit 113, a mutation identification unit 114, the quality-control unit 117, a drug search unit 118, and a report creation unit 115. The storage unit 12 stores a gene-panel-related information database 121, a reference sequence database 122, a mutation database 123, a drug database 124, and an analysis record log 151.
The information processing apparatus 1 creates a report that includes an analysis result corresponding to the gene panel having been used, even when a different gene panel is used for each analysis. The operator who uses the gene analysis system 100 can analyze the result of the panel test by a common analysis program irrespective of the type of the gene panel, and can create a report. Accordingly, when a panel test is performed, a bothersome operation, such as selecting an analysis program to be used for each gene panel and performing specific setting for the analysis program for each gene panel to be used, is omitted. Thus, convenience for the operator is improved.
When the operator of the information processing apparatus 1 has inputted gene panel information through the input unit 17, the information selection unit 112 refers to the gene-panel-related information database 121, and controls the algorithm of the analysis program such that the analysis program performs analysis of genes to be analyzed, in accordance with the inputted gene panel information.
Here, the gene panel information may be any information that can specify the gene panel that has been used in measurement performed by the sequencer 2. Examples of the gene panel information include the gene panel name, the names of genes to be analyzed with the gene panel, the gene panel ID, and the like.
The sequence data reading unit 111 obtains sequence information generated by the sequencer 2. When the information processing apparatus 1 does not have the function of the auxiliary apparatus 2a shown in
On the basis of the gene panel information inputted through the input unit 17, the information selection unit 112 changes the analysis algorithm for performing analysis so as to correspond to the genes to be analyzed with the gene panel indicated by the gene panel information.
The information selection unit 112 outputs an instruction based on the gene panel information, to at least one of the data adjustment unit 113, the mutation identification unit 114, the drug search unit 118, and the report creation unit 115. Through this configuration, the information processing apparatus 1 can output a result of analyzing the sequence information, on the basis of the inputted gene panel information.
That is, the information selection unit 112 is a function block that performs control so as to obtain gene panel information of a gene panel that includes a plurality of genes to be analyzed, and cause the output unit 13 to output the result of analyzing the sequence information on the basis of the obtained gene panel information.
When genes contained in various samples are analyzed in the test institution 120 which performs panel tests, various gene panels are used in accordance with the gene groups to be analyzed of the respective samples.
Even when various combinations of genes to be analyzed have been analyzed by use of various gene panels, the information processing apparatus 1 can appropriately output results of analyzing sequence information because the information processing apparatus 1 is provided with the information selection unit 112.
That is, if the operator merely selects gene panel information, without setting an analysis program to be used in analysis of sequence information and performing analysis for each gene to be analyzed, a result of analysis of each piece of sequence information can be appropriately outputted.
For example, when the information selection unit 112 outputs, to the data adjustment unit 113, an instruction based on the gene panel information, the data adjustment unit 113 performs an alignment process or the like reflecting the gene panel information.
In accordance with the gene panel information, the information selection unit 112 issues an instruction so that the reference sequence (reference sequence in which wild-type genome sequences and mutation sequences are incorporated) to be used by the data adjustment unit 113 when mapping the sequence information is limited only to the reference sequence for the genes that correspond to the gene panel information.
In this case, since the gene panel information has already been reflected in the result of the process performed by the data adjustment unit 113, the information selection unit 112 need not output an instruction based on the gene panel information to the mutation identification unit 114 which subsequently performs a process following the process performed by the data adjustment unit 113.
For example, in a case where the information selection unit 112 outputs an instruction based on the gene panel information to the mutation identification unit 114, the mutation identification unit 114 performs a process reflecting the gene panel information.
For example, in accordance with the gene panel information, the information selection unit 112 issues an instruction so that the region of the mutation database 123 referred to by the mutation identification unit 114 is limited to only mutations related to the genes that correspond to the gene panel information. Accordingly, the gene panel information is reflected in the result of the process performed by the mutation identification unit 114.
Here, a process for receiving an input of gene panel information shown in step S107 of
Here, an example configuration is described in which the controller 11 causes the input unit 17 to display a GUI for inputting gene panel information, thereby allowing the operator to input gene panel information. Here, an example is described in which the input unit 17 is provided with a touch panel that allows the operator to perform an input operation onto the presented GUI.
First, the controller 11 of the information processing apparatus 1 causes the input unit 17 to display a GUI for allowing the operator to select gene panel information. On the basis of the input operation onto the GUI by the operator, the gene panel information is obtained (step S201).
On the basis of information selected by the operator in the information displayed as the GUI, the information selection unit 112 searches the gene-panel-related information database 121 and reads gene panel information that corresponds to the selected information.
In addition, the information processing apparatus 1 reads gene panel information that is included in the analysis request received from the medical institution 210.
When a gene panel corresponding to the selected information is already registered in the gene-panel-related information database 121 (YES in step S202), and the gene panel matches the gene panel included in the analysis request received from the medical institution 210 (YES in step S203), the information selection unit 112 receives the input. Then, the information selection unit 112 causes the input unit 17 to display a message to the effect that the inputted gene panel can be used (step S204).
Meanwhile, when the gene panel corresponding to the selected information is not registered in the gene-panel-related information database 121, i.e., when an unregistered gene panel has been selected (NO in step S202), the information selection unit 112 causes the input unit 17 to display a message to the effect that the inputted gene panel cannot be used (step S205), and prohibits analysis from being performed by the information processing apparatus 1.
In this case, instead of the message to the effect that the gene panel cannot be used, a message that indicates an error may be displayed. The message may be, for example, “The selected gene panel is not registered.” and may further include a message that urges re-input, such as “Please input gene panel information again”.
When the gene panel corresponding to the selected information does not match the gene panel included in the analysis request received from the medical institution 210 (NO in step S203), the information selection unit 112 causes the input unit 17 to display a message to the effect that the inputted gene panel cannot be used (step S205), and prohibits analysis from being performed by the information processing apparatus 1.
Also in this case, instead of the message that the gene panel cannot be used, a message that indicates an error may be displayed. The message may be, for example, “The selected gene panel is different from that in the order.” and may further include a message that urges re-input, such as “Please input gene panel information again”.
This process can prevent performing sequencing by use of an inappropriate gene panel and performing unnecessary analysis operation, and can eliminate wasteful use of gene panels and wasteful operation of the gene analysis system 100.
Next, a GUI for allowing the operator to input gene panel information is described with reference to
As shown in
The list of gene panel names on the GUI is displayed on the basis of gene panel names of gene panels that are provided with gene panel IDs and that are already registered in the gene-panel-related information database 121.
In the GUI shown in
Next, data stored in the gene-panel-related information database 121 referred to by the information selection unit 112 when gene panel information has been inputted through the input unit 17 is described with reference to
In the gene-panel-related information database 121, as shown in data 121A in
In the gene-panel-related information database 121, as shown in data 121B in
As shown in
As shown in
When performing a panel test using a gene panel that allows batch analysis of a plurality of abnormalities being present in a plurality of genes and related to a plurality of diseases, the disease to which each sample is related may be inputted. For example, as shown in
As shown in
Here, update of information stored in the gene-panel-related information database 121 is described with reference to
Update of the information stored in the gene-panel-related information database 121 can be performed by use of an update patch provided from the analysis system management institution 130 to the test institution 120.
Provision of the update patch from the analysis system management institution 130 may be targeted to test institutions 120 that have paid the system usage fee. For example, the analysis system management institution 130 may notify each test institution 120 that the condition for providing an update patch is existence of an update patch that can be provided and payment of the system usage fee. Such a notification can appropriately urge each test institution 120 to pay the system usage fee.
As shown in
When a “register” button is pressed after the file name has been inputted, a request for updating the information regarding the genes that correspond to the gene names included in the file is associated with the test institution ID, and is transmitted to the management server 3 via the communication unit 14. The generation of the update request and the association of the update request with the test institution ID may be performed by the controller 11 shown in
The analysis system management institution 130 permits the information processing apparatus 1 to download information that includes the gene IDs provided to the gene names included in the update request received by the management server 3; and the gene panel ID provided to the gene panel for analyzing the genes.
Alternatively, as shown in
When a “register” button is pressed after the gene name has been inputted, a request for updating the information regarding the gene that corresponds to the gene name is associated with the test institution ID, and is transmitted to the management server 3 via the communication unit 14. The analysis system management institution 130 permits the information processing apparatus 1 to download information that includes the gene ID provided to the gene name included in the update request received by the management server 3; and the gene panel ID provided to the gene panel for analyzing the gene.
The column for inputting a “registration file name” in
For example, information of input candidates to be displayed is provided from the management server 3 to the information processing apparatus 1 in advance, and is stored in the storage unit 12. Then, when a click operation onto the GUI in the input column has been detected, all of the gene names that can be updated may be presented as input candidates to allow the operator to select therefrom, or a gene name that can be updated and that matches the character string inputted by the operator may be presented as an input candidate. Alternatively, for example, at the time point when the operator has inputted one character “E” in the column for inputting a “gene name” shown in
The gene-panel-related information database 121 may store each gene name, the gene ID of the gene, and the name of a protein coded by the gene in association with one another.
In this case, even when the inputted character string is not a gene name but a protein or the like coded by the gene, the information selection unit 112 can obtain a gene name and a gene ID that are associated with the inputted protein name, with reference to the gene-panel-related information database 121.
When a protein name has been inputted in the column for inputting a “gene name” and the register button has been pressed, a GUI may be displayed that shows a gene name associated with the protein name to allow the operator to confirm that the displayed gene name is the correct one.
The management unit 116 stores, in the analysis record log 151, whenever necessary, an analysis record which includes the number of times of operation performed by the analysis execution unit 110, the number of analyzed genes, the total number of identified mutations, and the like, in association with the gene panel IDs and the gene IDs. At a desired frequency (for example, each day, each week, or each month), the management unit 116 reads data including the analysis record and the like from the analysis record log 151, and transmits the data in association with the test institution ID, to the management server via the communication unit 14.
The communication unit 14 allows the information processing apparatus 1 to communicate with the management server 3 via the communication line 4. Data transmitted from the communication unit 14 to the management server 3 can include the test institution ID, gene panel IDs, gene IDs, analysis records, update requests, and the like. Data received from the management server 3 can include gene panel information, gene names that can be updated, and the like.
The flow of a process for analyzing base sequences of samples is described with reference to
First, in step S31 in
Next, in step S32, base sequences of genes of the sample and nucleic acid contained in the quality control sample which have been subjected to the pretreatment are read by the sequencer 2.
Specifically, step S32 is a step of reading the base sequences of one or a plurality of fragmented genes, to be analyzed, which have been collected after the pretreatment. The sequence information includes the gene base sequences having been read in this step. One or a plurality of fragmented nucleic acids, to be analyzed, which have been collected after the pretreatment may also be referred to as a “library”.
Subsequently, in step S33, the information processing apparatus 1 analyzes each gene base sequence having been read, and specifies the presence or absence of mutation in the sequence, the position of the mutation, the type of the mutation, and the like. By the read gene base sequence being analyzed, the detected gene mutation is identified.
Next, when the quality control sample has been measured, the quality-control unit 117 generates, in step S34, a quality evaluation index for evaluating the quality of the panel test. The information processing apparatus 1 may evaluate the quality of the panel test having been performed, on the basis of the generated quality evaluation index.
Lastly, the information processing apparatus 1 creates a report that includes an analysis result such as information related to the gene mutation identified in step S33, and information indicating the quality of the panel test, such as the quality evaluation index generated by the quality-control unit 117 in step S34. The created report is provided to the medical institution 210.
The type of the sequencer 2 that can be used in the present embodiment is not limited in particular, and any sequencer that can analyze a plurality of targets to be analyzed in one run can be suitably used. In the following, one example is described in which a sequencer of Illumina, Inc. (San Diego, Calif.) (for example, MySeq, HiSeq, NextSeq, or the like), or an apparatus that employs a similar method to that of the sequencer of Illumina, Inc. is used.
Through combination of a Bridge PCR method and a Sequencing-by-synthesis technique, the sequencer of Illumina, Inc. can perform sequencing, with a target DNA amplified and synthesized to a huge number on a flow cell. The sequencer of Illumina, Inc. can simultaneously analyze base sequences of genes of a plurality of subjects.
(a. Pretreatment)
Next, the procedure of the pretreatment in step S31 in
When base sequences of each of a sample and the quality control sample are to be analyzed, DNA is firstly extracted from the sample that includes genes to be analyzed and the quality control sample that corresponds to the gene panel to be used (step S300 in
In this case, the DNA derived from the sample and the DNA derived from the quality control sample are each subjected to the processes of step S301 and the subsequent steps.
Since the DNA extracted from the quality control sample is subjected to the same process as that for the DNA extracted from the sample, a quality evaluation index useful for evaluating the quality of the sequence analysis in the panel test can be generated.
The usage of the quality control sample is not limited thereto. For example, as shown in
Alternatively, as shown in
By comparison between a result of analysis of DNA derived from the quality control sample that includes mutation and a result of analysis of DNA derived from the quality control sample that does not include mutation, a quality evaluation index useful for evaluating the quality of the sequence analysis in the panel test can be generated.
Furthermore, as shown in
The sample that includes genes to be analyzed may be a combination of a blood sample and a tissue (for example, tumor cell) sample. In this case, for one subject, a sample that contains nucleic acid extracted from the tissue, and a sample that contains nucleic acid extracted from the blood are subjected to sequencing as individual samples.
In the processes of step S301 and the subsequent steps, DNA derived from the sample and DNA derived from the quality control sample may be mixed to perform the processes of step S301 and the subsequent steps without individually processing the DNA derived from the sample and the DNA derived from the quality control sample. Accordingly, in all the processes of step S301 and the subsequent steps, the conditions for both of the samples are the same, and thus, a more accurate quality evaluation index can be generated. In addition, it is not necessary to use a part of the lanes in the flow cell used for the sequencer 2, only for the DNA fragments prepared from the quality control sample. Accordingly, the limited number of lanes can be effectively used for DNA fragments derived from the sample that include genes to be analyzed.
In this case, (1) a reagent for appropriately fragmenting a standard gene which is a gene included in the quality control sample and each gene to be analyzed in the panel test, to prepare a library, and (2) a reagent that contains RNA baits for appropriately capturing the respective DNA fragments after the standard gene included in the quality control sample and the gene to be analyzed in the panel test have been fragmented, are preferably used.
In one embodiment, the quality control sample is a composition containing a plurality of standard genes. The quality control sample can be prepared by mixing a plurality of standard genes. A reagent obtained by these standard genes being mixed and stored in a single container can be provided as the quality control sample to the test institution 120. A plurality of standard genes that are stored in separate containers may be provided in the form of a kit as the quality control sample, to the test institution 120. The quality control sample may be in the form of a solution or may be in a solid (powder) state. When the quality control sample is provided in the form of a solution, an aqueous solvent, such as water or TE buffer, known to a person skilled in the art, can be used as the solvent.
The quality control sample is described with reference to
A quality control sample A1 corresponding to a gene panel A includes at least two of a standard gene that includes SNV, a standard gene that includes Insertion, a standard gene that includes Deletion, a standard gene that includes CNV, and a standard gene that includes Fusion. For example, the quality control sample A1 includes, as the standard gene, a partial sequence of gene A that includes “SNV” with respect to a wild type, and a partial sequence of gene B that includes “Insertion” with respect to a wild type.
A first standard gene and a second standard gene included in the quality control sample may be different DNA molecules, or may be connected to each other. When the first standard gene and the second standard gene are connected to each other, the sequence of the first standard gene and the sequence of the second standard gene may be directly connected to each other, or a spacer sequence may intervene between the sequence of the first standard gene and the sequence of the second standard gene.
The spacer sequence is preferably a sequence that is less likely to be included in the sample subjected to the gene test. For example, the spacer sequence can be a sequence in which only a plurality (for example, 100) of adenine bases are consecutive.
The standard gene may be a gene that is included in the gene panel to be analyzed, or a gene that is not included in the gene panel to be analyzed. The standard gene may be a gene of a biological species for which the gene test is performed, or a gene of a different biological species. For example, when the gene test is performed for a human, the standard gene can be a gene of an animal other than a human, a plant, a bacterium, or the like.
The method for synthesizing the standard gene is not limited in particular. For example, the standard gene can be synthesized by a known DNA synthesizer. Alternatively, a gene derived from an organism, which serves as a template, is amplified by PCR and purified, whereby the standard gene may be obtained. Alternatively, PCR amplification is performed by using, as a template, a standard gene synthesized by a DNA synthesizer and purification is performed, whereby the standard gene may be obtained.
The length of the standard gene is not limited in particular. For example, the length of the standard gene can be 50 nucleotides or greater. In the case of amplification by PCR, amplification can be advantageously performed with ease if the length of the standard gene is 2000 nucleotides or less. When the standard gene is synthesized by a DNA synthesizer, up to several kbp of the standard gene can be synthesized.
The concentration of the standard gene in the quality control sample is not limited in particular. For example, the concentration of the standard gene can be approximately the same as a DNA concentration in the sample.
The standard gene in the quality control sample may be single-stranded or double-stranded. The standard gene may be linear or circular.
For example, (1) a standard gene that includes substitution mutation is prepared, (2) a standard gene that includes fusion mutation is prepared, and (3) the quality control sample and the sample are mixed together, whereby a sequence analysis sample is prepared. Next, (4) the standard genes and the sample-derived genomic DNA in the sequence analysis sample are subjected to the pretreatment (fragmentation, DNA concentration, PCR amplification using tag primer, and the like) and the sequence analysis, to obtain sequence information of the target gene. In the sequence analysis, an index for quality control is obtained, and the quality of the result of analysis of the target gene is evaluated on the basis of the index of sequence analysis of the standard DNA molecules. The operator is allowed to determine reliability of the result of analysis of the gene to be analyzed, on the basis of the result of the quality evaluation.
In the example above, in (3), the quality control sample and the subject-derived sample are mixed together, but are not limited thereto. For example, the quality control sample and the sample may be separately subjected to the sequence analysis in (4) without being mixed together.
When the panel test using the same gene panel is repeatedly performed, the same quality control sample may be repeatedly used. As shown in data 121D in
If a plurality of quality control samples having different combinations of standard genes are selectively used for each panel test, each week, or each month, the quality-control unit 117 can generate the quality evaluation index for evaluating the quality of the process for detecting mutations in the panel test, on the basis of detection of mutations of the increased number of kinds of standard genes. Therefore, the comprehensiveness of the quality control of the panel test is improved.
For example,
Next, as shown in
Next, as shown in
The adapter sequence is a sequence to be used for performing sequencing in a later step. According to one embodiment, in Bridge PCR, the adapter sequence can be a sequence that is hybridized with oligo DNA which is the capture molecule immobilized on the flow cell.
In one aspect, as shown in the upper part of
The adapter sequence may be added to the DNA fragment by using a known technique in this technical field. For example, the adapter sequence may be added by subjecting the DNA fragment to PCR reaction using a PCR primer that includes the adapter sequence and a sequence of the gene to be analyzed. Alternatively, the DNA fragment may be blunted and the adapter sequence may be ligated.
Next, as shown in
The biotinylated RNA bait library is composed of biotinylated RNAs (hereinafter, referred to as RNA bait) that are hybridized with genes to be analyzed. The RNA bait may have any length. For example, long oligo RNA bait having about 120 bp may be used in order to enhance specificity.
The panel test using the sequencer 2 in the present embodiment may be a test in which a specific gene is to be analyzed, or may be a test in which a large number of genes (for example, 100 or greater) are to be analyzed.
The reagent to be used in the panel test includes a set of RNA baits that respectively correspond to the large number of genes. When the panel is different, the number and the kinds of genes to be tested are different, and thus, the set of RNA baits included in the reagent to be used in the panel test is also different. When a gene different from a gene to be analyzed is used as a standard gene, a bait that binds to the standard gene needs to be prepared.
As shown in
Accordingly, as shown in the middle part of
Accordingly, the DNA fragments hybridized with the RNA baits, i.e., the DNA fragments to be analyzed, can be selectively collected and concentrated. This process is performed for each sample, whereby the library of each sample is prepared (see step I in
In a case where base sequences of genes of a plurality of subjects are to be simultaneously analyzed, the measurement sample to be applied to the flow cell is prepared by mixing libraries of a plurality of samples (see step II in
In order to enable sorting of base sequences for each subject or for each sample from the sequence information of DNA of samples derived from a plurality of subjects, an index sequence that is different for each library is added.
In step S304a in
Accordingly, pieces of sequence information of base sequences regarding genes of samples derived from different subjects can be distinguished from one another on the basis of the base sequences of the index sequences added thereto. If no index sequence is added to nucleic acid that is not to be analyzed (for example, non-subject-derived genes, genes derived from the quality control sample, and the like), only the sequence information of base sequences of a subject-derived sample can be made the target to be analyzed.
The index sequence can be added to the DNA fragments by use of a known technique in this technical field. For example, in a case where SureSelect XT of Agilent is used, if the DNA fragments collected in step S304 in
Alternatively, the index sequence may be added when the adapter sequence is added to each DNA fragment. For example, the index sequence may be added by subjecting the DNA fragments to PCR reaction using a PCR primer that includes the adapter sequence, the index sequence, and a sequence of the gene to be analyzed.
Next, in step S304b in
In preparation of a measurement sample, a measurement sample sheet in which a sample ID is associated with an index sequence ID and an index sequence added to the library of each sample is created and managed.
The measurement sample sheet may include setting information which is common among the libraries of all the samples included in the measurement sample, and sample information specific to the library of each sample included in the measurement sample. As shown in
The setting information may include “sample gene” which is information related to the method for preparing the libraries of samples used in preparation of a measurement sample. In the column “sample gene”, “PCR product”, “amplicon”, or the like can be entered, for example.
Further, the setting information may include “read sequence length” which is the set value of the length of the base sequence read by the sequencer 2, information related to the adapter 1 sequence and the adapter 2 sequence, and the like. Here, the read sequence is the base sequence read through sequencing by the sequencer 2.
As shown in
As for the measurement sample sheet, the sequencer 2 or the auxiliary apparatus 2a shown in
Here, a method for preparing a sample is described with reference to
Accordingly, the measurement sample is prepared so as to contain a previously determined amount of nucleic acid, by mixing the libraries prepared from the recommended number of subject-derived samples. Here, the “previously determined amount of nucleic acid” means the amount of nucleic acid recommended in accordance with the specification of the flow cell that suits the sequencer 2 and the amounts of the primer, the probe, and the like included in the gene panel. Here, the amount of nucleic acid is the number of moles of nucleic acid.
The molar concentration of nucleic acid can be calculated on the basis of, for example, the absorbance at 260 nm, the molecular weight of the DNA fragment, and the molar absorption coefficient of nucleic acid. After purifying the PCR product after the PCR reaction for adding the index sequence is performed in step S304a in
For example, in a case where the length of the library obtained as the PCR product is 100 bp, and the concentration is x (ng/μl), if 330 is used as the average molecular weight of deoxyribonucleotide, the molar concentration of the PCR product is calculated as x/33 (pmol/μl). When a previously determined amount (for example, y (pmol)) of the nucleic acid of this library is mixed, 33×x/y (l) is used to prepare the measurement sample, with use of an autopipette or the like.
In a case where the number of subject-derived samples to be analyzed is insufficient, even if libraries prepared from the subject-derived samples are mixed by the same amount as the amount that is used when the number of subject-derived samples to be analyzed is the recommended number of samples, the amount of nucleic acid of the measurement sample does not become the previously determined amount of nucleic acid. However, if, in order to attain the previously determined amount of nucleic acid, the amount of libraries prepared from the subject-derived samples is increased or decreased to prepare a measurement sample, the data amount of sequence information obtained per sample will vary for each sequencing run.
Therefore, even when the number of subject-derived samples to be analyzed is insufficient, it is preferable that the libraries prepared from the subject-derived samples are mixed by the same amount as the amount that is used when the number of subject-derived samples to be analyzed is the recommended number of samples, while the amount of nucleic acid of the measurement sample is made the previously determined amount of nucleic acid. Such a method for preparing the measurement sample is described with reference to
In this case, the amount of the non-subject-derived nucleic acid included in the measurement sample may be at least an amount corresponding to, or greater than, the amount of nucleic acid per sample included in the measurement sample. Examples of the non-subject-derived nucleic acid include “PhiX DNA” or the like provided from Illumina, Inc. but is not limited thereto. For example, nucleic acid or the like, of a quality control sample for a gene panel, which has an adapter sequence added thereto may be used. In order not to hinder the reading of base sequences in the sequencer 2, it is preferable to use high diversity nucleic acid (i.e., nucleic acid having high diversity in sequences) in which nucleic acids having diverse base sequences are mixed, compared with low diversity nucleic acid (i.e., nucleic acid having low diversity in sequences) in which a large amount of nucleic acids having the same base sequences is included.
Alternatively, in a case where the number of subject-derived samples to be analyzed is smaller than the recommended number of samples, it is sufficient that libraries having been prepared from subject-derived samples and having already been analyzed (i.e., those not to be analyzed any more) are used as a substitute, to prepare a measurement sample. As the libraries prepared from subject-derived samples having already been analyzed, libraries each having added thereto an index sequence different from any of the index sequences added to libraries that have been prepared from subject-derived samples to be analyzed and that are to be mixed in order to prepare a measurement sample, are used.
For example, in a case where a first index sequence is added to a library prepared from a subject-derived sample to be analyzed, it is sufficient that a library prepared from a subject-derived sample that has been analyzed and that has added thereto a second index sequence different from the first index sequence is used to prepare a measurement sample.
Also in this case, the variation in the amount of nucleic acid derived from each sample included in the measurement sample is preferably in a range of ±10%. The amount of the subject-derived nucleic acid having been analyzed included in the measurement sample may be at least an amount corresponding to, or greater than, the amount of nucleic acid per sample included in the measurement sample.
When the measurement sample is prepared according to the methods shown in
Next, with reference to
As shown in the left part to the center part of
Although
Next, as shown in the right part of
Subsequently, as shown in
That is, each DNA fragment to be analyzed (for example, Template DNA in
On the flow cell, the adapter 2 sequence on the 3′ end side is immobilized in advance, and the adapter 2 sequence on the 3′ end side of the DNA fragment is bound to the adapter 2 sequence on the 3′ end side on the flow cell to produce a bridge-like state, whereby a bridge is formed (“3” in
When DNA elongation is caused by DNA polymerase in this state (“4” in
Through repetition of the bridge formation, the DNA elongation, and the denaturation in this order, a large number of single-stranded DNA fragments are locally amplified and immobilized, whereby clusters can be formed (“6” to “9” in
Then, as shown in
First, to the single-stranded DNA immobilized on the flow cell (the upper left part of
The sequence primer may be any sequence primer that is designed so as to be hybridized to a part of the adapter sequence, for example. In other words, it is sufficient that the sequence primer is designed to amplify the DNA fragment derived from the sample DNA. In a case where an index sequence is added, it is sufficient that the sequence primer is designed to further amplify the index sequence.
After the sequence primer is added, one base elongation is caused, by the DNA polymerase, for dNTP labeled with fluorescence and having the 3′ end blocked. Since dNTP having the 3′ end side blocked is used, polymerase reaction stops when one base elongation has been realized. Then, the DNA polymerase is removed (the right middle part of
In order to determine four kinds of bases, the photographs are taken by a fluorescence microscope for the fluorescent colors respectively corresponding to A, C, G, and T, while a wavelength filter is changed. After all the photographs have been obtained, bases are determined from the photograph data. Then, the fluorescent substance and the protecting group blocking the 3′ end side are removed, and the reaction goes onto the next polymerase reaction. With this flow assumed as one cycle, the second cycle, the third cycle, and so on are performed, whereby sequencing of the entire length can be performed.
According to the technique described above, the length of the chain that can be analyzed reaches 150 bases ×2, and analysis in a unit much smaller than the unit of a picotiter plate can be performed. Thus, due to the high density, a huge amount of sequence information of 40 to 200 Gb can be obtained in one analysis.
The gene panel used for reading the read sequences by the sequencer 2 means an analysis kit for analyzing a plurality of targets to be analyzed in one run as described above. In one embodiment, the gene panel can be an analysis kit for analyzing a plurality of gene sequences related to a plurality of diseases.
When used herein, the term “kit” is intended to mean a package that includes containers (for example, bottles, plate, tubes, and dishes) each containing a specific material. Preferably, the kit includes instructions for using each material. When used in the context of a kit herein, “include (is included)” is intended to mean a state of being included in any of individual containers that form a kit. The kit can be a package in which a plurality of different compositions are packed into one, and the forms of the compositions can be as described above. In the case of a solution form, the solution may be contained in a container.
The kit may include a substance A and a substance B that are mixed in one container or that are in separate containers. The “instructions” indicate the procedure of applying each component in the kit to a therapy and/or diagnosis. The “instructions” may be written or printed on paper or any other medium, or may be stored in an electronic medium such as a magnetic tape, a computer readable disk or tape, or a CD-ROM. The kit can include a container that contains a diluent, a solvent, a washing liquid, or another reagent. Further, the kit may also include an apparatus that is necessary for the kit to be applied to a therapy and/or diagnosis.
In one embodiment, the gene panel may be provided with one or more of the quality control sample, reagents such as the reagent for fragmenting nucleic acid, the reagent for ligation, the washing liquid, the PCR reagent (dNTP, DNA polymerase, etc.), and the magnetic beads, as described above. The gene panel may be provided with one or more of oligonucleotides for adding the adapter sequences to the fragmented DNA, oligonucleotides for adding the index sequence to the fragmented DNA, the RNA bait library, and the like.
The index sequence provided to each gene panel can be a sequence that is unique to the gene panel and that identifies the gene panel. The RNA bait library provided to each gene panel can be a library that is unique to the gene panel and that includes RNA baits that correspond to the test genes of the gene panel.
In a case where each piece of information included in the measurement sample sheet shown in
Accordingly, on the basis of the sample information in the measurement sample sheet, the information processing apparatus 1 can selectively analyze only the sequence information of a gene of a subject-derived sample having a predetermined index sequence added thereto, in the entire sequence information obtained from the sequencer 2.
The control performed by the information processing apparatus 1 based on the information of the measurement sample sheet is described with reference to
When a measurement sample is prepared according to the method shown in
For example, the aforementioned PhiX provided from Illumina, Inc. has adapter sequences already ligated thereto, and can be suitably used as non-subject-derived nucleic acid.
When the sequence information is associated with an index sequence in the measurement sample sheet (i.e., sequence information that includes an index sequence) (YES in step S51), the information processing apparatus 1 performs analysis (step S52). When the sequence information is not associated with any index sequence (NO in step S51), the information processing apparatus 1 does not perform at least a part of the analysis that is to be performed on the sequence information associated with an index sequence (step S53). That is, the information processing apparatus 1 selectively performs the processes of step S109 and the subsequent steps shown in
Meanwhile, when a measurement sample is prepared according to the method shown in
When the sequence information is associated with an index sequence in the measurement sample sheet (YES in step S51a), the information processing apparatus 1 advances to step S51b, and when the sequence information is not associated with any index sequence (NO in step S51a), the information processing apparatus 1 advances to step S53.
In step S51b, the information processing apparatus 1 refers to the measurement sample sheet, and when the sequence information is associated with the first index sequence added to nucleic acid prepared from a sample to be analyzed (YES in step S51b), the information processing apparatus 1 performs analysis (step S52). Meanwhile, when the sequence information is not associated with the first index sequence (NO in step S51a), the information processing apparatus 1 does not perform at least a part of the analysis that is to be performed on the sequence information associated with an index sequence (step S53). That is, the information processing apparatus 1 selectively performs the processes of step S109 and the subsequent steps shown in
According to this configuration, the information processing apparatus 1 can efficiently perform the analysis only on the base sequences of the samples to be analyzed.
Next, the processes performed by the sequence data reading unit 111, the data adjustment unit 113, and the mutation identification unit 114 of the analysis execution unit 110 are described on the basis of the flow of the process shown in
First, in step S10 shown in
The sequence information is data indicating base sequences read by the sequencer 2. The sequencer 2 performs sequencing on a large number of nucleic acid fragments obtained by use of a specific gene panel, reads the sequence information thereof, and provides the information processing apparatus 1 with the read sequence information as sequence information.
The sequence data reading unit 111 may obtain sequence information read from an exon region of a nucleic acid sequence, or may obtain sequence information read from an exon region having at least 10 Mb (10 million bases) or greater.
Next, in step S11, the sequence data reading unit 111 reads sequence information stored in a file of the sequence information to be analyzed.
In one aspect, the sequence information may include a quality score of each base in the sequence as well as the sequence having been read. Both the sequence information obtained by subjecting, to the sequencer 2, an FFPE sample from a lesion site of a subject and the sequence information obtained by subjecting a blood sample of the subject to the sequencer 2 are inputted to the information processing apparatus 1.
Q=−10 log 10E
In this equation, E represents an estimated value of the probability of incorrect base assignment. The greater the value of Q is, the lower the probability of the error is. The smaller the value of Q is, the greater the portion of the read that cannot be used is.
In addition, false-positive mutation assignment also increases, which could result in a lowered accuracy of the result. “False-positive” means that the read sequence is determined as having mutation although the read sequence does not have true mutation to be determined.
“Positive” means that the read sequence has true mutation to be determined, and “negative” means that the read sequence does not have mutation to be determined. For example, if the quality score is 20, the probability of error is 1/100. This means that the accuracy (also referred to as “base call accuracy”) of each base in the gene sequence having been read is 99%.
Subsequently, in step S12 shown in
The data adjustment unit 113 performs alignment for both the sequence information obtained by subjecting an FFPE sample from a lesion site of a subject to the sequencer 2, and the sequence information obtained by subjecting a blood sample of the same subject to the sequencer 2.
The reference sequence information indicates the reference sequence name (reference sequence ID), the sequence length of the reference sequence, and the like in the reference sequence database 122. The read sequence name is information that indicates the name (read sequence ID) of each read sequence for which alignment has been performed. The position information indicates the position (leftmost mapping position) on the reference sequence at which the leftmost base of the read sequence has been mapped. The map quality is information that indicates the quality of mapping corresponding to the read sequence. The sequence is information that indicates the base sequence (example: . . . GTAAGGCACGTCATA . . . ) corresponding to each read sequence.
Further, each reference sequence in the reference sequence database 122 is provided with metadata which indicates gene panel information. For example, the gene panel information provided to each reference sequence can directly or indirectly indicate the gene, to be analyzed, that corresponds to the reference sequence.
In one embodiment, the information selection unit 112 may perform control such that, when the data adjustment unit 113 obtains a reference sequence from the reference sequence database 122, the data adjustment unit 113 refers to the inputted gene panel information and the metadata of each reference sequence, and selects a reference sequence that corresponds to the gene panel information.
For example, in one aspect, the information selection unit 112 may control the data adjustment unit 113 so as to select a reference sequence that corresponds to the gene, to be analyzed, that is specified by the inputted gene panel information. This allows the data adjustment unit 113 to perform mapping only on the reference sequence related to the gene panel having been used, and thus efficiency of the analysis can be improved.
In another embodiment, the information selection unit 112 need not perform the above-described control. In this case, the information selection unit 112 merely controls the mutation identification unit 114 or the report creation unit 115 as described later.
In step S401 shown in
In one aspect, the data adjustment unit 113 calculates a score that indicates the degree of matching between the read sequence and the reference sequence. The score indicating the degree of matching can be, for example, a percentage identity between two sequences. For example, the data adjustment unit 113 specifies positions at which bases of the read sequence and bases of the reference sequence are the same, obtains the number of the positions, and divides the number of the positions at which the bases are the same, by the number of bases (the number of bases in the comparison window) of the read sequence compared with the reference sequence, to calculate the percentage.
In the calculation of the score indicating the degree of matching between a read sequence and a reference sequence, the data adjustment unit 113 may perform calculation such that, when the read sequence includes a predetermined mutation (for example, InDel: Insertion/Deletion) with respect to the reference sequence, a score lower than that calculated in the normal calculation is obtained.
In one aspect, for a read sequence that includes at least one of insertion and deletion with respect to the reference sequence, the data adjustment unit 113 may correct the score by, for example, multiplying the score calculated in the above-described normal calculation, by a weighting factor according to the number of bases that correspond to the insertion/deletion. The weighting factor W may be calculated as, for example, W={1−(1/100)×(the number of bases corresponding to insertion/deletion)}.
The data adjustment unit 113 calculates the score of the degree of matching while changing the mapping position of the read sequence with respect to each reference sequence, thereby specifying a position on the reference sequence at which the degree of matching with the read sequence satisfies a predetermined criterion. At this time, an algorithm known in this technical field, such as dynamic programming, the FASTA method, and the BLAST method, may be used.
With reference back to
When alignment of all the read sequences included in the sequence information obtained by the sequence data reading unit 111 has not been performed (NO in step S405), the data adjustment unit 113 returns to step S401. When alignment of all the read sequences included in the sequence information has been performed (YES in step S405), the data adjustment unit 113 completes the process of step S12.
With reference back to
In step S14 shown in
In one aspect, the mutation identification unit 114 generates a result file on the basis of the extracted gene mutation.
As shown in
In
With reference back to
In the mutation position information, “CHROM” indicates the chromosome number, and “POS” indicates the position at the chromosome number. “REF” indicates the base in the wild type, and “ALT” indicates the base that is present after the mutation. “Annotation” indicates information related to mutation. “Annotation” may be information indicating mutation of amino acid such as “EGFR C2573G” or “EGFR L858R”, for example. For example, “EGFR C2573G” indicates mutation in which cysteine at the 2573-th residue of protein “EGFR” is substituted with glycine.
As in the above-described example, “Annotation” of the mutation information may be information for converting mutation based on the base information to mutation based on amino acid information. In this case, the mutation identification unit 114 can convert the mutation based on the base information to the mutation based on the amino acid information, according to the information of “Annotation” which has been referred to.
The mutation identification unit 114 searches the mutation database 123 by using, as a key, information (for example, base information corresponding to mutation position information and mutation) that specifies the mutation included in the result file. For example, the mutation identification unit 114 may search the mutation database 123 by using, as a key, information of any of “CHROM”, “POS”, “REF”, and “ALT”. When the gene mutation extracted by comparison between the alignment sequence derived from the blood sample and the alignment sequence derived from the lesion site is already registered in the mutation database 123, the mutation identification unit 114 identifies the mutation as a mutation existing in the sample, and provides annotation (for example, “EGFR L858R”, “BRAF V600E”, or the like) to the mutation included in the result file.
In one embodiment, before the mutation identification unit 114 searches the mutation database 123 on the basis of the result file, the information selection unit 112 may cause mutations that do not correspond to the gene panel information having been inputted to the mutation identification unit 114, to be masked in (excluded from) the result file.
For example, in one aspect, the mutation identification unit 114 which has been notified of the gene panel information from the information selection unit 112 may refer to a table indicating the correspondence relationship between each gene to be analyzed and position information (for example, “CHROM” and “POS”) as shown in
The flow of a process in which the drug search unit 118 generates a list including information related to drugs is described with reference to
The drug search unit 118 searches the drug database 124 by using, as a key, the mutation ID provided to each gene mutation identified by the mutation identification unit 114 (step S15a). On the basis of the search result, the drug search unit 118 generates a list including information regarding drugs related to the mutations (step S16a). The generated list is incorporated into the report created by the report creation unit 115.
Data 124A stored in the drug database 124 and used when the drug search unit 118 searches the drug database 124 and generates a drug list is described with reference to
As shown in
Each mutation ID in the drug database 124 may be provided with “metadata related to gene-panel-related information”, which is metadata related to gene panel information. The drug search unit 118 refers to the “metadata related to gene-panel-related information” in accordance with an instruction from the information selection unit 112.
The drug search unit 118 changes the range in which the drug database 124 is searched, to a range indicated by the metadata. Accordingly, in accordance with the “metadata related to gene-panel-related information” provided to each drug and the inputted gene panel information, the drug search unit 118 can narrow the drugs that should be referred to in the drug database, and can generate a list that includes information regarding drugs according to the gene panel information.
The drug search unit 118 may search the drug database 124 having a data structure shown in
The drug search unit 118 searches the drug database 124 which stores data 124B shown in
On the basis of the search result, the drug search unit 118 generates a list that includes the mutation, the related drug that corresponds to the mutation, and information regarding the approval of the related drug (step S16b).
The drug search unit 118 may search the drug database 124 having a data structure shown in
The drug search unit 118 searches the drug database 124 which stores the data 124B shown in
When the searched drug has been approved, the drug search unit 118 determines whether or not the disease (disease name or disease ID) of the subject from whom the sample has been collected, and the disease (for example, disease name or disease ID of the “target disease” shown in
When the disease of the subject and the “target disease” match each other, the drug search unit 118 associates the drug of the search result, as an approved drug, with the mutation, and generates a list that includes the mutation, the related drug corresponding to the mutation, information regarding the approval of the related drug, and the like (step S16b).
Meanwhile, when the disease of the subject and the “target disease” are different from each other, the drug search unit 118 determines that the searched related drug is a drug having a possibility of off-label use, associates the determination result with the mutation, and generates a list that includes the mutation, the related drug corresponding to the mutation, information regarding the approval of the related drug, and the like (step S16b).
The identification information (for example, disease name, disease ID, or the like) for identifying the disease of the subject can be inputted through the input unit 17 by an operator or the like when performing gene analysis, for example. In this case, the information selection unit 112 obtains information related to the disease corresponding to the sample inputted by the operator, and identifies the disease. Alternatively, as shown in
Alternatively, in the test institution 120, a sample ID and a subject ID are managed so as to be associated with a disease ID, and the information selection unit 112 may obtain a disease ID that corresponds to a sample, on the basis of the subject ID or the sample ID. For example, the information selection unit 112 may obtain, via a communication line, a disease ID associated with a subject ID (or sample ID) obtained by reading a recording means of a label attached to each container which stores a sample. The disease ID may be included in a header region of sequence information shown in
As in the data 124B shown in
In this manner, the drug search unit 118 searches the drug database 124 in which gene mutations, target diseases, and drugs are stored in association with one another, and checks a detected gene mutation against a disease specified by the information selection unit 112, thereby being able to create a list according to the disease corresponding to the sample. The report creation unit 115 creates a report by use of the list created by the drug search unit 118.
The drug search unit 118 may search the drug database 124 having a data structure shown in
The drug search unit 118 searches the drug database 124 which stores data 124C shown in
The data 124A shown in
The report creation unit 115 creates a report (corresponding to step S111 in
On the basis of the gene panel information from the information selection unit 112, the report creation unit 115 may select the target to be put on the report, and may delete, from the report, information that has not been selected. Alternatively, the information selection unit 112 may control the report creation unit 115 such that information related to genes that correspond to the gene panel information inputted through the input unit 17 is selected as the target to be put on the report and information that has not been selected is deleted from the report.
Next, a specific example of the report created by the report creation unit 115 is described with reference to
In the upper left part of the example of the report shown in
Below these items, the gene panel name such as “Panel A” is also indicated as the gene panel information. Further, the quality evaluation index “QC index” obtained from the process using the quality control sample, the result of analysis thereof, and the like is also outputted in the report.
The report created by the report creation unit 115 may be transmitted in the form of data, from the output unit 13 to the communication terminal 5 (see
As shown in
For example, in a case where PhiX DNA has been used as the non-subject-derived nucleic acid in preparation of a measurement sample, the measurement sample is not a target for the sorting performed for each index sequence in step S10 shown in
For example, in a case where a library prepared from a subject-derived sample having been analyzed is used in preparation of a measurement sample, the library is not a target to be analyzed, and thus, the process of step S11 and the subsequent steps in
For example, in a case where a quality control sample has been used in preparation of a measurement sample, there is no need to identify mutation, and thus, the process of step S15 and the subsequent steps in
That is, the information processing apparatus 1 selectively performs the process of step S10 and the subsequent steps shown in
Here, a quality evaluation index for evaluating the quality of sequence information is described. Examples of the quality evaluation index include the following:
Index (i): quality evaluation index indicating the quality of reading of base information performed by the sequencer 2.
Index (ii): quality evaluation index indicating the proportion of bases read by the sequencer 2, to bases included in a plurality of genes to be analyzed.
Index (iii): quality evaluation index indicating the depth of sequence information.
Index (iv): quality evaluation index indicating variation in the depth of sequence information.
Index (v): quality evaluation index indicating whether or not all the mutations in standard genes included in the quality control sample have been detected.
Index (i) can include
index (i-1): quality score, and
index (i-2): cluster concentration.
The above-described quality evaluation indexes are described with reference to
Index (i-1): Quality Score
The quality score is an index indicating the accuracy of each base in the gene sequence read by the sequencer 2.
For example, when the sequence information is outputted as a FASTQ file from the sequencer 2, the quality score is also included in the sequence information (see
Index (i-2): Cluster Concentration
The sequencer 2 locally amplifies and immobilizes a large number of single-stranded DNA fragments on a flow cell to form a cluster (see “9” in
For example, in a case where the cluster density becomes excessively high, and the clusters are excessively close to each other or overlap each other, the contrast of the taken image of the flow cell, i.e., the S/N ratio, is lowered, whereby focusing by the fluorescence microscope becomes difficult. Therefore, fluorescence cannot be accurately detected. As a result, the sequence cannot be accurately read.
This index indicates how many bases in the target region have been read, among bases (also including bases other than those in the target region) read by the sequencer 2. This index can be calculated as a ratio between the total number of bases in the target region and the total number of bases having been read.
Index (iii): Quality Evaluation Index Indicating the Depth of Sequence Information.
This index is an index based on the total number of pieces of the sequence information obtained by reading the bases included in a gene to be analyzed. This index can be calculated as a ratio between the total number of bases having depths greater than or equal to a predetermined value among the bases having been read, and the total number of bases having been read.
The depth means the total number of pieces of sequence information having been read for one base.
This index is an index indicating the uniformity of the depth. When the number of pieces of the sequence information having been read in a certain portion in the region having been read is extremely great, uniformity of the depth is low. When the sequence information is relatively uniform over the region having been read, the uniformity of the depth is high. The uniformity of the depth is not limited thereto. For example, the uniformity can be expressed as a number by using the interquartile range (IQR). The greater the IQR is, the lower the uniformity is. The lower the IQR is, the higher the uniformity is.
Index (v): Quality Evaluation Index Indicating Whether or not all the Mutations in Standard Genes Included in the Quality Control Sample have been Detected.
This index is an index indicating that the mutations in standard genes included in the quality control sample have been detected and accurately identified. For example, mutations (see the column of “Variant”) in standard genes included in the quality control sample A shown in
The information processing apparatus 1 is a computer which performs commands of a program which is software that realizes each function. This computer includes one or more processors, for example, and also includes a computer-readable storage medium having the program stored therein. In the computer, the processor reads the program from the storage medium and performs the program, whereby the object of the present disclosure is achieved. As the processor, a CPU (Central Processing Unit) can be used, for example. As the storage medium, a “non-transitory and tangible medium”, such as a ROM (Read Only Memory), tape, disk, card, semiconductor memory, or programmable logical circuit, can be used. The computer may further include a RAM (Random Access Memory) or the like onto which the program is developed. The program may be supplied to the computer via a desired transmission medium (communication network, broadcast wave, or the like) that can transmit the program. One aspect of the present disclosure can also be realized in the form of a data signal which is realized by electronic transmission of the program and which is embedded in a carrier wave.
The present disclosure is not limited to the above-described embodiments, and various modifications can be made without departing from the scope of the claims. Embodiments obtained by combining as appropriate technological means disclosed in different embodiments are also included in the technological scope of the present disclosure.
Number | Date | Country | Kind |
---|---|---|---|
2018-163954 | Aug 2018 | JP | national |