The present invention relates to an information processing program, an information processing method, and an information processing apparatus.
In recent years, an impact of new viruses has been predicted to develop vaccines and the like by analyzing genomes that make up deoxyribonucleic acid (DNA) and ribonucleic acid (RNA) of humans and other organisms. Furthermore, research has been conducted for detecting, on the basis of the genomes, mutation (point mutation) such as cancer and gene abnormalities such as gene mutation, and for prophylaxes and diagnoses of diseases.
Specifically, there has been known a technique of storing base sequences of the human genome in association with positions and providing differences between individuals as useful semantic information. For example, positional information of the base sequence is obtained in response to request information of a genome analysis service or the like, and base sequence information to be associated with the obtained positional information is responded.
Examples of the related art include [Patent Document 1] Japanese Laid-open Patent Publication No. 2012-234558; and [Patent Document 2] Japanese Laid-open Patent Publication No. 2012-157283.
According to an aspect of the embodiments, there is provided a non-transitory computer-readable storage medium storing an information processing program for causing a computer to perform processing including: obtaining a plurality of pieces of segmented genome data, which is genome information of a specific individual; generating a plurality of pieces of segmented codon data obtained by encoding each of the plurality of pieces of segmented genome data in a codon unit on the basis of a codon conversion table in which a codon and a code are associated with each other; identifying, on the basis of reference codon data obtained by encoding reference genome data to be a reference in the codon unit and each of the plurality of pieces of segmented codon data, a type and a position of an appearance of gene mutation different from the code that appears in the reference codon data among a plurality of the codes that appears in the plurality of pieces of segmented codon data; and generating a gene mutation inverted index in which the gene mutation and the type and position of the appearance of the gene mutation are associated with each other.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention.
However, a base sequence output from a sequencer is segmented for each several hundred bytes (B). Moreover, a data size of the base sequence of the human genome is 3 giga bytes (GB), which is significantly large.
Conventionally, since the base sequence of the personal genome is obtained in a segmented state, the segmented base sequences are connected. While the Burrows-Wheeler (BW) transform, block sorting, or the like is often used as a technique for the connecting, segmented parts are searched for and connected so that an analysis time is significantly long. Therefore, the length of the base sequence analysis time and the data size after the connection are the issues.
In one aspect, an object is to provide an information processing program, an information processing method, and an information processing apparatus capable of shortening a personal genome analysis time and reducing a data size.
Hereinafter, embodiments of an information processing program, an information processing method, and an information processing apparatus according to the present invention will be described in detail with reference to the drawings. Note that the present invention is not limited by those embodiments. Furthermore, the individual embodiments may be appropriately combined with each other unless otherwise contradicted.
First, a genome is genetic information, which is a base sequence of DNA or RNA. Next, codons, which are three bases, determine amino acids, and multiple amino acids make up protein. Moreover, multiple proteins bind to form a primary structure, a secondary structure, and a tertiary (higher-order) structure.
Meanwhile, there are four types of DNA or RNA bases, which are denoted by symbols of “A”, “G”, “C”, and “T” or “U”. Furthermore, a group of three base sequences is called a “codon”, and there are 64 kinds of them, which determine 20 kids of amino acids. Each of the amino acids is denoted by symbols of “A” to “Y”. Multiple types of codons are associated with one amino acid. Accordingly, for example, an amino acid “alanine (Ala)” is associated with codons “GCU”, “GCC”, “GCA”, and “GCG”. It has the characteristic of being the same amino acid even if the third base is different.
As illustrated in
Then, the information processing apparatus 10 generates reference codon data “@Ek . . . ” obtained by encoding the reference genome data “UUU . . . ” in codon units using the codon conversion table. Furthermore, the information processing apparatus 10 generates a bitmap-type reference inverted index in which the codon code and the appearance position in the reference codon data are associated with each other.
In such a state, the information processing apparatus 10 obtains segmented genome data α to η from a sequencer that performs sequencing of the personal genome. Then, the information processing apparatus 10 refers to the codon conversion table to encode, in codon units, each of the segmented genome data α to η in the state of being segmented, thereby generating segmented codon data α to η.
Then, the information processing apparatus 10 sequentially extracts partial reference codon data from the reference codon data using the reference inverted index for each of the segmented codon data α to η. By sequentially comparing the segmented codon data with the partial reference codon data in codon units, a single-nucleotide polymorphism (hereafter referred to as gene mutation) indicating a subtle difference in genetic information between individuals is detected, and a bitmap-type SNPs inverted index (gene mutation inverted index) in which a type and position of mutation are associated with each other is generated.
At this time, the information processing apparatus 10 narrows down the codon sequences corresponding to the segmented codon data using the reference inverted index without connecting the segmented codon data α to η, and extracts the partial reference codon data, whereby the generation of the SNPs inverted index may be speeded up. For example, the information processing apparatus 10 narrows down the position where encoded data “@, E, k, F, O” of the reference codon sequence “UUU, UCC, MG, UCA, UGG” to be searched for, which is specified in advance, appears from the reference inverted index of the reference genome by searching for the longest-match string.
Here, the information processing apparatus 10 compares the segmented codon data with the extracted partial reference codon data in codon units, and detects gene mutation of different codons. Then, the information processing apparatus 10 initializes the inverted index to “0”, and sets “1” only to bits corresponding to bases of the different codons and their positions, whereby an SNPs inverted index 20 may be generated without connecting all the segmented codon data.
In this manner, even in a case where the personal genome is segmented, the information processing apparatus 10 is enabled to analyze the gene mutation while it remains segmented, whereby the analysis time of the personal genome may be shortened.
The communication unit 11 is a processing unit that controls communication with another device, and is implemented by, for example, a communication interface or the like. For example, the communication unit 11 transmits/receives data to/from the sequencer, which is a providing source of the personal genome, and receives segmented genome data 13α to 13η segmented for each several hundred B.
The storage unit 12 is a processing unit that stores various types of data, various programs to be executed by the control unit 30, and the like, and is implemented by, for example, a memory, a hard disk, or the like. This storage unit 12 stores segmented genome data 13, a codon conversion table 14, segmented codon data 15, reference genome data 16, reference codon data 17, a reference inverted index 18, partial reference codon data 19, and the SNPs inverted index 20.
The segmented genome data 13 is segmented base sequence data obtained by segmenting the personal genome to be analyzed into a predetermined size. For example, the segmented genome data 13 is data including the segmented genome data 13α “UUU . . . ” to the segmented genome data 13η “. . . C” generated from the personal genome “UUUUUCA . . . ”. This segmented genome data 13 is obtained by the control unit 30.
The codon conversion table 14 is information to be used at a time of encoding a base sequence, and stores codons and codes in association with each other. Specifically, the codon conversion table 14 is conversion information in which high-frequency codons with high appearance frequencies and codes assigned to the high-frequency codons are associated with each other.
The reference genome data 16 is base sequence data of the human genome to be a reference. For example, the Japanese reference genome is made public by Tohoku University Tohoku Medical Megabank Organization. Note that the reference genome data 16 may be stored in advance, or may be obtained from a server or the like designated by the control unit 30.
The reference codon data 17 is encoded data obtained by encoding the reference genome data 16 in codon units.
The reference inverted index 18 is a bitmap-type inverted index in which the codon code and the appearance position in the reference codon data 17 are associated with each other.
As illustrated in
The SNPs inverted index 20 is a bitmap-type inverted index of gene mutation for the personal genome. Specifically, the SNPs inverted index 20 is a bitmap-type inverted index in which each of the segmented codon data 15 is compared with the partial reference codon data 19 extracted from the reference codon data 17 and a type and position of different gene mutation are associated with each other. Note that the structure of the SNPs inverted index 20 is similar to that of the reference inverted index 18, and descriptions thereof will be omitted. For example, the SNPs inverted index 20 is provided with a bitmap for each type of predetermined SNPs such as the third base SNPs.
The control unit 30 is a processing unit that takes overall control of the information processing apparatus 10, and is, for example, a processor or the like. The control unit 30 includes an acquisition unit 31, an encoding unit 32, a generation unit 33, and an output unit 34. Note that the acquisition unit 31, the encoding unit 32, the generation unit 33, and the output unit 34 are implemented by an electronic circuit included in a processor, a process executed by the processor, or the like.
The acquisition unit 31 is a processing unit that obtains the segmented genome data 13. For example, the acquisition unit 31 obtains the segmented genome data 13 from a specified providing source, and stores it in the storage unit 12. Note that the acquisition unit 31 may receive the segmented genome data 13 transmitted from the providing source, or may obtain it periodically.
The encoding unit 32 is a processing unit that encodes the segmented genome data 13.
At this time, the encoding unit 32 assigns a codon code to a three-base sequence registered in the codon conversion table 14, and encodes it.
The generation unit 33 is a processing unit that generates the SNPs inverted index 20. Specifically, in a case where the segmented genome data 13 of the personal genome of a certain individual is obtained, the generation unit 33 analyzes the segmented genome, and generates a bitmap-type SNPs inverted index 20 indicating gene mutation.
For example, the generation unit 33 sequentially extracts the partial reference codon data 19 from the reference codon data 17 using the reference inverted index 18 for each of the segmented codon data α to η, and sequentially compares it. Then, the generation unit 33 detects gene mutation included in each of the segmented codon data, sets “1” to a bit that associates a type and position of the gene mutation, generates the SNPs inverted index 20, and stores it in the storage unit 12.
Here, the generation unit 33 may speed up the generation of the SNPs inverted index 20 by extracting the partial reference codon data 19 from the segmented codon data α to η using the reference inverted index 18. In view of the above, the extraction process and the generation of the SNPs inverted index 20 will be specifically described with reference to
As illustrated in
An example of performing narrowing down using the reference inverted index 18 in this manner will be described with reference to
Here, as an example, how the reference codon data 17 is narrowed down according to the codon sequence (4) “UUU(@), UCC(E), AAG(k), UCA(F)” using the reference inverted index 18 will be described with reference to
The generation unit 33 obtains the bitmap b_UUU (see 1-a in
In this manner, the left shifting and the AND operation are used to search for positions where “1” appears in succession. Specifically, the generation unit 33 shifts the bitmap b21 to the left to generate a bitmap b22. The generation unit 33 obtains the bitmap b_AAG, and performs an AND operation on the bitmap b_AAG and the bitmap b22, thereby generating a bitmap b23. Since “1” stands at the offsets “9” and “n+2” of the bitmap b23, it is found that the offsets 7 to 9 and n to n+2 include the codon “UUU(@), UCC(E), AAG(k)”.
The generation unit 33 shifts the bitmap b23 to the left to generate a bitmap b24. The generation unit 33 obtains the bitmap b_UCA, and performs an AND operation on the bitmap b_UCA and the bitmap b24, thereby generating a bitmap b25. Since “1” stands at the offsets “10” and “n+3” of the bitmap b25, it is found that the offsets 7 to 10 and n to n+3 include the codon “UUU(@), UCC(E), AAG(k), UCA(F)”.
Moreover, the generation unit 33 shifts the bitmap b25 to the left to generate a bitmap b26. A bitmap b_UGG corresponding to the codon UGG(0) is obtained for the codon sequence (5) “UUU(@), UCC(E), AAG(k), UCA(F), UGG(@)”. An AND operation is performed on the bitmap b_UGG and the bitmap b26 to generate a bitmap b27. Since “1” stands only at the offset “n+4” of the bitmap b27, it is found that the offsets n to n+4 include the codon “UUU(@), UCC(E), AAG(k), UCA(F), UGG(O)” and multiple candidates have been narrowed down to one.
In this manner, the generation unit 33 executes the process illustrated in
Next, the generation unit 33 compares the segmented codon data 15 of the personal genome with the partial reference codon data 19 extracted in
In this case, the generation unit 33 sets “1” to the 0 bit position in advance in the bitmap (bitmap b_UUU) of the codon code “UUU@” of the reference inverted index 18.
Next, the SNPs inverted index 20 of the personal genome corresponding to the reference inverted index 18 will be described. As for the gene mutation type, U, C, A, G, and comprehensive bitmaps are provided for each of the third, second, and first bases according to the three bases of the codon. (The comprehensive bitmap may be omitted.) In general, gene mutation commonly occurs in the third base, and rarely occurs in the second base and the first base. Note that a dynamic dictionary storing bitmaps and detailed information associated with special gene mutation is also provided.
As illustrated in
That is, as illustrated in
Returning to
Thereafter, the acquisition unit 31 obtains each of the segmented genome data (S102), and the encoding unit 32 encodes each of the segmented genome data in codon units on the basis of the codon conversion table 14 to generate each of the segmented codon data 15 (S103).
Then, the generation unit 33 extracts, using the reference inverted index 18, the partial reference codon data 19 corresponding to the individual segmented codon data 15 in the state of being segmented (S104). Thereafter, the generation unit 33 compares the extracted partial reference codon data 19 with each of the segmented codon data 15 to identify a type and position of gene mutation (S105), and generates the SNPs inverted index 20 (S106).
As described above, the information processing apparatus 10 compresses and encodes the base sequence of the reference genome in codon units, and generates a bitmap-type inverted index corresponding to the codon. Furthermore, the information processing apparatus 10 compresses and encodes the segmented base sequences of the personal genome in codon units, searches for the longest-match string using the inverted index of the reference genome, narrows down the area, and extracts a partial reference genome corresponding to each of the segmented base sequences. At the same time, the information processing apparatus 10 compares the partial reference genome with the segmented personal genome in codon units to generate the bitmap-type SNPs inverted index. Therefore, the information processing apparatus 10 is enabled to analyze the gene mutation and generate SNPs inverted index by codon encoding without connecting the segmented personal genome, whereby it becomes possible to shorten the analysis time of the personal genome and to reduce the data size.
Note that, with regard to the reference inverted index associated with the 64 types of codons and their positions, the narrowing down may be speeded up by expanding the codons to N grams although the index size increases. For example, when expanded to 2 grams, the narrowing down is speeded up to ½ although the size increases from 64 types to 4,096 (64×64) types. Furthermore, in a similar manner to the text inverted index, the SNPs inverted index may also be hashed with adjacent prime numbers. Since each of the SNPs may be compressed to a capacity of 6 to 8 bits, the SN Ps inverted index per person is approximately several kilo bytes (KB). Meanwhile, while the extraction of the partial reference codon data fails if the SNPs are included near the top of the segmented genome data, it is sufficient if the narrowing down is carried out again from the codon after the SNPs.
In a second embodiment, an example of being applied to canceration diagnosis at hospitals will be described.
In such a system configuration, the information processing apparatus 10 of each of the hospitals analyzes the personal genome of a patient to generate an electronic medical record, and analyzes a causal relationship with cancer. Then, the information processing apparatus 10 of each of the hospitals transmits the causal relationship to the information processing apparatus 10 of the integrated analysis center. With this arrangement, the information processing apparatus 10 of the integrated analysis center is enabled to collect the causal relationships executed in the individual hospitals.
Here, the analysis of the causal relationship in each of the hospitals will be described.
Specifically, the information processing apparatus 10 of each of the hospitals obtains the personal genome of each patient and uses the method according to the first embodiment, thereby generating a bitmap-type SNPs inverted index 20 corresponding to each patient. At this time, in a case where special gene mutation is detected during gene mutation analysis of segmented genome data 13 of each personal genome, the information processing apparatus 10 stores detailed information in a dynamic dictionary. Note that codon sequence storage in an encoding part may be omitted. Then, the information processing apparatus 10 performs an AND operation (logical product) on the SNPs inverted index 20 corresponding to each patient with a disease such as cancer, thereby extracting SNPs common to individual diseases and generating an SNPs inverted index representing the causal relationship with each disease.
For example,
Furthermore, the example of
Then, the information processing apparatus 10 of each of the hospitals transmits, to the integrated analysis center, the SNPs inverted index corresponding to each cancer as a causal relationship indicating the analysis result. For example, as illustrated in
In this manner, by using the method according to the second embodiment, it becomes possible to link electronic medical records and genomes between the integrated analysis center and the hospitals to analyze the causal relationship between cancer and SNPs using the SNPs inverted index, which may be used for medical treatment such as a prophylaxis and analysis of cancer. Furthermore, SNPs of personal information included in the genome may be protected by multi-layered encryption with multiple different passwords.
In a third embodiment, an example in which an integrated analysis center collects causal relationships of canceration from individual hospitals and comprehensively analyzes each canceration.
In such a system configuration, the information processing apparatus 10 of the integrated analysis center collects, from each of the hospitals, data associated with individual causal relationships corresponding to diseases, such as cancer, using the method described in the second embodiment, for example. Then, the information processing apparatus 10 of the integrated analysis center decodes the collected data, and analyzes integrated causal relationships common among the individual hospitals.
Here, the integrated analysis of the causal relationships in the integrated analysis center will be described.
Specifically, the integrated analysis center collects causal relationship analysis results from each of the hospitals, and decodes them, thereby obtaining an SNPs inverted index corresponding to each disease, such as cancer. Then, the integrated analysis center performs, for each cancer, an AND operation (logical product) on the SNPs inverted index obtained from each of the hospitals, thereby extracting SNPs common to individual cancers and generating an inverted index for each cancer.
For example,
Furthermore,
As a result, the integrated analysis center is enabled to further analyze the causal relationship between cancer and SNPs using the AND operation on the basis of data received from each of the hospitals. Furthermore, the integrated analysis center may deliver the integrated analysis result of the causal relationship between cancer and SNPs to each of the hospitals. At this time, the integrated analysis center delivers the integrated analysis result (SN Ps inverted index) corresponding to each disease, such as cancer, to each of the hospitals using the transmission method described in the second embodiment.
In a fourth embodiment, an example of performing canceration diagnosis at each hospital using an integrated analysis result generated in the third embodiment will be described.
In such a system configuration, the integrated analysis center generates an integrated analysis result (SNPs inverted index) of causal relationships between cancer and SNPs using, for example, the method described in the third embodiment. Then, the integrated analysis center delivers the integrated analysis result to each of the hospitals using the method described in the second embodiment. Thereafter, each of the hospitals decodes the delivered integrated analysis result, and uses it to perform canceration diagnosis.
Here, the canceration diagnosis at each hospital will be described.
As illustrated in
In the example of
In this manner, by using the method according to the fourth embodiment, it becomes possible to achieve prophylaxes and diagnoses of diseases, such as canceration, at each hospital. Furthermore, since the prophylaxes and diagnoses may be performed using the integrated SN Ps inverted index using the causal relationships collected from each of the hospitals, it becomes possible to achieve resource-saving high-speed prophylaxes and diagnoses with high statistical accuracy, which may be used for early detection of cancer and the like. Note that the integrated analysis result for each cancer type generated by the integrated analysis center is an exemplary statistical inverted index.
Although the embodiments of the present invention have been described above, the present invention may be implemented in various different modes in addition to the embodiments described above.
The numerical values, the number of bits, the codon codes, the number of the codon codes, the arrangement of codes, and the like used in the embodiments described above are merely examples, and may be changed in any way.
Pieces of information including a processing procedure, a control procedure, a specific name, various types of data, and parameters described above or illustrated in the drawings may be optionally changed unless otherwise specified. Note that the codon conversion table 14 is exemplary codon conversion information, the reference codon data 17 is exemplary reference encoded data, and the SNPs inverted index 20 is exemplary gene mutation inverted index. The acquisition unit 31 is an exemplary acquisition unit, the encoding unit 32 is an exemplary generation unit that generates multiple segmented codon data, and the generation unit 33 is an exemplary generation unit that generates the gene mutation inverted index.
Furthermore, each component of each device illustrated in the drawings is functionally conceptual, and is not necessarily physically configured as illustrated in the drawings. In other words, specific forms of distribution and integration of the individual devices are not limited to those illustrated in the drawings. That is, all or a part of them may be configured by being functionally or physically distributed or integrated in optional units depending on various loads, use situations, or the like.
Moreover, all or any part of individual processing functions performed in each device may be implemented by a central processing unit (CPU) and a program analyzed and executed by the CPU, or may be implemented as hardware by wired logic.
Next, an exemplary hardware configuration of the information processing apparatus 10 will be described.
The communication device 10a is a network interface card or the like, and communicates with another server. The HDD 10b stores programs and DBs for activating the functions illustrated in
The processor 10d reads a program that executes processing similar to the processing of each processing unit illustrated in
In this manner, the information processing apparatus 10 operates as an information processing apparatus that executes an information processing method by reading and executing a program. Furthermore, the information processing apparatus 10 may implement functions similar to those in the embodiments described above by reading the program described above from a recording medium with a medium reading device and executing the read program described above. Note that other programs referred to in the embodiments are not limited to being executed by the information processing apparatus 10. For example, the present invention may be similarly applied also to a case where another computer or server executes the program, or a case where such a computer and server cooperatively execute the program.
This program may be distributed via a network such as the Internet. Furthermore, this program may be recorded in a computer-readable recording medium such as a hard disk, a flexible disk (FD), a compact disc read only memory (CD-ROM), a magneto-optical disk (MO), or a digital versatile disc (DVD), and may be executed by being read from the recording medium by a computer.
All examples and conditional language provided herein are intended for the pedagogical purposes of aiding the reader in understanding the invention and the concepts contributed by the inventor to further the art, and are not to be construed as limitations to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although one or more embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
This application is a continuation application of International Application PCT/JP2020/026730 filed on Jul. 8, 2020 and designated the U.S., the entire contents of which are incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/JP2020/026730 | Jul 2020 | US |
Child | 18149768 | US |