The present disclosure provides a novel method, system, and related technology for processing sequence information on a single biological unit. More specifically, the present disclosure provides a system for automatically constructing and providing microorganism genomic data.
While construction of microorganism genomic data is advancing, current data is often based on metagenomic information. This is insufficient in terms of quality and quantity as information when targeting the analysis on complex bacterial flora.
Although some genetic information (genomic information, etc.) is obtained for each single biological unit, information processing thereof with sufficient quality has not been provided.
As a result of diligent studies, the inventors have completed a system for accumulating sequence information on a single biological unit at a single biological unit level and automatically constructing and providing highly accurate microorganism genomic data therefrom.
Examples of embodiments of the present disclosure include the following.
A method of processing sequence information on a single biological unit, the method comprising:
(A) a step of clustering partial sequence information of sequence information on a plurality of single biological units for each of the same lineages based on an organism lineage identification sequence;
(B) optionally, a step of adding, to the cluster, partial sequence information on the single biological units corresponding to the cluster in a database; and
(C) a step of creating a sequence information draft for the single biological units by using the partial sequence information of sequence information on the single biological units and sequence information on the single biological units in the database.
The method of item 1, further comprising utilizing a database if step (B) is performed.
A method of processing sequence information on a single biological unit, the method comprising:
A) a step of extracting genes without duplication from a draft in a database;
B) a step of calculating the number or a ratio of corresponding drafts for each of the genes; and
C) a step of selecting a gene with the number or ratio of the corresponding drafts greater than or equal to a predetermined value as a candidate of an organism lineage identification sequence.
A method of processing sequence information on a single biological unit, the method comprising:
(A) a step of clustering partial sequence information of sequence information on a plurality of single biological units for each of the same lineages based on an organism lineage identification sequence; and
(B) a step of comparing partial sequence information corresponding to the cluster in a database with partial sequence information of the cluster, calculating a degree of similarity for each partial sequence, and identifying a partial sequence with a degree of similarity greater than or equal to a predetermined degree of similarity as an organism lineage identification sequence.
A method of processing sequence information on a single biological unit, the method comprising:
(D) a step of ranking partial sequence information of sequence information on a plurality of single biological units from a highest quality based on a predetermined judgement criterion;
(E) a step of selecting a population of a predetermined number of pieces of partial sequence information of sequence information on the plurality of single biological units from a highest ranking based on the ranking to construct a draft with a greater length than the partial sequence information from the partial sequence information; and
(E′) a step of selecting a population of a set of a different number of pieces of partial sequence information of sequence information on the single biological units from the population, constructing a draft with a greater length than the partial sequence information from the partial sequence information, and selecting a draft created up to this point based on a predetermined judgment criterion.
The method of processing sequence information on a single biological unit of item 4, the method comprising:
(F) a step of comparing the selected draft with partial sequence information of sequence information on a single biological unit that was not selected in (E) or (E′) and selecting partial sequence information of sequence information on the single biological unit having a sequence of a portion that is not included in the draft;
(G) a step of creating a longer draft by using the sequence information selected in (F) and the selected draft;
(G′) optionally, a step of repeating (G) until the longer draft reaches a full length of sequence information; and
(G″) optionally, a step of repeating the steps of item 4 based on a judgment criterion with a lower criterion in the entire partial sequence information constituting the draft.
A method of processing sequence information on a single biological unit, the method comprising:
(A) a step of clustering partial sequence information of sequence information on a plurality of single biological units for each of the same lineages based on an organism lineage identification sequence;
(H) a step of evaluating partial sequence information of sequence information on the plurality of single biological units constituting sequence information on a single biological unit based on an organism lineage identification sequence within the cluster of the same lineage, and reclustering within the cluster of the same lineage;
(H′) a step of comparing a sequence information draft created from the cluster of the same lineage with a sequence information draft created from the reclustered cluster; and
(I) a step of determining whether reclustering in (H) is appropriate for a result of comparison based on a predetermined judgment criterion, and, if appropriate, registering the draft in a database as a new group.
The method of item 6, wherein the reclustering is performed through network analysis and community detection.
A method of processing sequence information on a single biological unit, the method comprising:
(A) a step of clustering partial sequence information of sequence information on a plurality of single biological units for each of the same lineages based on an organism lineage identification sequence;
(D) a step of ranking partial sequence information of sequence information on a plurality of single biological units belonging to the cluster of the same lineage from a highest quality based on a predetermined judgement criterion;
(E) a step of selecting a population of a predetermined number of pieces of partial sequence information of sequence information on the plurality of single biological units from a highest ranking based on the ranking to construct a draft with a greater length than the partial sequence information from the partial sequence information;
(E″) a step of selecting a population of a set of a different number of pieces of partial sequence information of sequence information on the single biological unit from the population, constructing a draft with a greater length than the partial sequence information from the partial sequence information, and evaluating a draft created up to this point based on a predetermined judgment criterion;
(H) a step of evaluating partial sequence information of sequence information on the plurality of single biological units constituting sequence information on a single biological unit based on an organism lineage identification sequence within the cluster of the same lineage, and reclustering within the cluster of the same lineage, if evaluation of the draft does not change due to an increase in the number in a population of a set of sequence information;
(H′) a step of comparing a sequence information draft created from the cluster of the same lineage with a sequence information draft created from the reclustered cluster; and
(J) a step of determining whether reclustering in (H) is appropriate for a result of comparison based on a predetermined judgment criterion, and, if appropriate, repeating (D) to (E′) for partial sequence information of sequence information on the plurality of single biological units belonging to the reclustered cluster.
A program for implementing a method of processing sequence information on a single biological unit on a computer, the method comprising:
(A) a step of clustering partial sequence information of sequence information on a plurality of single biological units for each of the same lineages based on an organism lineage identification sequence;
(B) optionally, a step of adding, to the cluster, partial sequence information on the single biological units corresponding to the cluster in a database; and
(C) a step of creating a sequence information draft for the single biological units by using the partial sequence information of sequence information on the single biological units and sequence information on the single biological units in the database.
The program of item 9, further comprising utilizing a database if step (B) is performed.
A program for implementing a method of processing sequence information on a single biological unit on a computer, the method comprising:
A) a step of extracting genes without duplication from a draft in a database;
B) a step of calculating the number or a ratio of corresponding drafts for each of the genes; and
C) a step of selecting a gene with the number or ratio of corresponding drafts greater than or equal to a predetermined value as a candidate of an organism lineage identification sequence.
A program for implementing a method of processing sequence information on a single biological unit on a computer, the method comprising:
(A) a step of clustering partial sequence information of sequence information on a plurality of single biological units for each of the same lineages based on an organism lineage identification sequence; and
(B) a step of comparing partial sequence information corresponding to the cluster in a database with partial sequence information of the cluster, calculating a degree of similarity for each partial sequence, and identifying a partial sequence with a degree of similarity greater than or equal to a predetermined degree of similarity as an organism lineage identification sequence.
A program for implementing a method of processing sequence information on a single biological unit on a computer, the method comprising:
(D) a step of ranking partial sequence information of sequence information on a plurality of single biological units from a highest quality based on a predetermined judgment criterion;
(E) a step of selecting a population of a predetermined number of pieces of partial sequence information of sequence information on the plurality of single biological units from a highest ranking based on the ranking to construct a draft with a greater length than the partial sequence information from the partial sequence information; and
(E′) a step of selecting a population of a set of a different number of pieces of partial sequence information of sequence information on the single biological units from the population, constructing a draft with a greater length than the partial sequence information from the partial sequence information, and selecting a draft created up to this point based on a predetermined judgment criterion.
The program for implementing a method of processing sequence information on a single biological unit on a computer of item 12, the method comprising:
(F) a step of comparing the selected draft with partial sequence information of sequence information on the single biological unit that was not selected in (E) or (E′) and selecting partial sequence information of sequence information on the single biological unit having a sequence of a portion that is not included in the draft;
(G) a step of creating a longer draft by using the sequence information selected in (F) and the selected draft;
(G′) optionally, a step of repeating (G) until the longer draft reaches a full length of sequence information; and
(G″) optionally, a step of repeating the steps of item 12 based on a judgment criterion with a lower criterion in the entire partial sequence information constituting the draft.
A program for implementing a method of processing sequence information on a single biological unit on a computer, the method comprising:
(A) a step of clustering partial sequence information of sequence information on a plurality of single biological units for each of the same lineages based on an organism lineage identification sequence;
(H) a step of evaluating partial sequence information of sequence information on the plurality of single biological units constituting sequence information on a single biological unit based on an organism lineage identification sequence within the cluster of the same lineage, and reclustering within the cluster of the same lineage;
(H′) a step of comparing a sequence information draft created from the cluster of the same lineage with a sequence information draft created from the reclustered cluster; and
(I) a step of determining whether reclustering in (H) is appropriate for a result of comparison based on a predetermined judgment criterion, and, if appropriate, registering the draft in a database as a new group.
The program of item 14, wherein the reclustering is performed through network analysis and community detection.
A program for implementing a method of processing sequence information on a single biological unit on a computer, the method comprising:
(A) a step of clustering partial sequence information of sequence information on a plurality of single biological units for each of the same lineages based on an organism lineage identification sequence;
(D) a step of ranking partial sequence information of sequence information on a plurality of single biological units belonging to the cluster of the same lineage from a highest quality based on a predetermined judgment criterion;
(E) a step of selecting a population of a predetermined number of pieces of partial sequence information of sequence information on the plurality of single biological units from a highest ranking based on the ranking to construct a draft with a greater length than the partial sequence information from the partial sequence information;
(E″) a step of selecting a population of a set of a different number of pieces of partial sequence information of sequence information on the single biological unit from the population, constructing a draft with a greater length than the partial sequence information from the partial sequence information, and evaluating a draft created up to this point based on a predetermined judgment criterion;
(H) a step of evaluating partial sequence information of sequence information on the plurality of single biological units constituting sequence information on a single biological unit based on an organism lineage identification sequence within the cluster of the same lineage, and reclustering within the cluster of the same lineage, if evaluation of the draft does not change due to an increase in the number in a population of a set of sequence information;
(H′) a step of comparing a sequence information draft created from the cluster of the same lineage with a sequence information draft created from the reclustered cluster; and
(J) a step of determining whether reclustering in (H) is appropriate for a result of comparison based on a predetermined judgment criterion, and, if appropriate, repeating (D) to (E′) for partial sequence information of sequence information on the plurality of single biological units belonging to the reclustered cluster.
A recording medium storing a program for implementing a method of processing sequence information on a single biological unit on a computer, the method comprising:
(A) a step of clustering partial sequence information of sequence information on a plurality of single biological units for each of the same lineages based on an organism lineage identification sequence;
(B) optionally, a step of adding, to the cluster, partial sequence information on the single biological units corresponding to the cluster in a database; and
(C) a step of creating a sequence information draft for the single biological units by using the partial sequence information of sequence information on the single biological units and sequence information on the single biological units in the database.
The recording medium of item 17, further comprising utilizing a database if step (B) is performed.
A recording medium storing a program for implementing a method of processing sequence information on a single biological unit on a computer, the method comprising:
A) a step of extracting genes without duplication from a draft in a database;
B) a step of calculating the number or a ratio of corresponding drafts for each of the genes; and
C) a step of selecting a gene with the number or ratio of corresponding drafts greater than or equal to a predetermined value as a candidate of an organism lineage identification sequence.
A recording medium storing a program for implementing a method of processing sequence information on a single biological unit on a computer, the method comprising:
(A) a step of clustering partial sequence information of sequence information on a plurality of single biological units for each of the same lineages based on an organism lineage identification sequence; and
(B) a step of comparing partial sequence information corresponding to the cluster in a database with partial sequence information of the cluster, calculating a degree of similarity for each partial sequence, and identifying a partial sequence with a degree of similarity greater than or equal to a predetermined degree of similarity as an organism lineage identification sequence.
A recording medium storing a program for implementing a method of processing sequence information on a single biological unit on a computer, the method comprising:
(D) a step of ranking partial sequence information of sequence information on a plurality of single biological units from a highest quality based on a predetermined judgment criterion;
(E) a step of selecting a population of a predetermined number of pieces of partial sequence information of sequence information on the plurality of single biological units from a highest ranking based on the ranking to construct a draft with a greater length than the partial sequence information from the partial sequence information; and
(E′) a step of selecting a population of a set of a different number of pieces of partial sequence information of sequence information on the single biological units from the population, constructing a draft with a greater length than the partial sequence information from the partial sequence information, and selecting a draft created up to this point based on a predetermined judgment criterion.
The recording medium storing a program for implementing a method of processing sequence information on a single biological unit on a computer of item 20, the method comprising:
(F) a step of comparing the selected draft with partial sequence information of sequence information on a single biological unit that was not selected in (E) or (E′) and selecting partial sequence information of sequence information on the single biological unit having a sequence of a portion that is not included in the draft;
(G) a step of creating a longer draft by using the sequence information selected in (F) and the selected draft;
(G′) optionally, a step of repeating (G) until the longer draft reaches a full length of sequence information; and
(G″) optionally, a step of repeating the steps of item 20 based on a judgment criterion with a lower criterion in the entire partial sequence information constituting the draft.
A recording medium storing a program for implementing a method of processing sequence information on a single biological unit on a computer, the method comprising:
(A) a step of clustering partial sequence information of sequence information on a plurality of single biological units for each of the same lineages based on an organism lineage identification sequence;
(H) a step of evaluating partial sequence information of sequence information on the plurality of single biological units constituting sequence information on a single biological unit based on an organism lineage identification sequence within the cluster of the same lineage, and reclustering within the cluster of the same lineage;
(H′) a step of comparing a sequence information draft created from the cluster of the same lineage with a sequence information draft created from the reclustered cluster; and
(I) a step of determining whether reclustering in (H) is appropriate for a result of comparison based on a predetermined judgment criterion, and, if appropriate, registering the draft in a database as a new group.
The recording medium of item 22, wherein the reclustering is performed through network analysis and community detection.
A recording medium storing a program for implementing a method of processing sequence information on a single biological unit on a computer, the method comprising:
(A) a step of clustering partial sequence information of sequence information on a plurality of single biological units for each of the same lineages based on an organism lineage identification sequence;
(D) a step of ranking partial sequence information of sequence information on the plurality of single biological units belonging to the cluster of the same lineage from a highest quality based on a predetermined judgment criterion;
(E) a step of selecting a population of a predetermined number of pieces of partial sequence information of sequence information on the plurality of single biological units from a highest ranking based on the ranking to construct a draft with a greater length than the partial sequence information from the partial sequence information; and
(E″) a step of selecting a population of a set of a different number of pieces of partial sequence information of sequence information on the single biological unit from the population, constructing a draft with a greater length than the partial sequence information from the partial sequence information, and evaluating a draft created up to this point based on a predetermined judgment criterion;
(H) a step of evaluating partial sequence information of sequence information on the plurality of single biological units constituting sequence information on a single biological unit based on an organism lineage identification sequence within the cluster of the same lineage, and reclustering within the cluster of the same lineage, if evaluation of the draft does not change due to an increase in the number in a population of a set of sequence information;
(H′) a step of comparing a sequence information draft created from the cluster of the same lineage with a sequence information draft created from the reclustered cluster; and
(J) a step of determining whether reclustering in (H) is appropriate for a result of comparison based on a predetermined judgment criterion, and, if appropriate, repeating (D) to (E′) for partial sequence information of sequence information on the plurality of single biological units belonging to the reclustered cluster.
A system for processing sequence information on a single biological unit, the system comprising:
(A) a clustering unit for clustering partial sequence information of sequence information on a plurality of single biological units for each of the same lineages based on an organism lineage identification sequence;
(B) optionally, an additional information addition unit for adding, to the cluster, partial sequence information on the single biological units corresponding to the cluster in a database; and
(C) a draft creation unit for creating a sequence information draft for the single biological units by using the partial sequence information of sequence information on the single biological units and sequence information on the single biological units in the database.
The system of item 25, further comprising a database utilization unit for utilizing a database if the system comprises the (B) addition information addition unit.
A system for processing sequence information on a single biological unit, the system comprising:
A) an extraction unit for extracting genes without duplication from a draft in a database;
B) a calculation unit for calculating the number or a ratio of corresponding drafts for each of the genes; and
C) a selection unit for selecting a gene with the number or ratio of corresponding drafts greater than or equal to a predetermined value as a candidate of an organism lineage identification sequence.
A system for processing sequence information on a single biological unit, the system comprising:
(A) a clustering unit for clustering partial sequence information of sequence information on a plurality of single biological units for each of the same lineages based on an organism lineage identification sequence; and
(B) an identification unit for comparing partial sequence information corresponding to the cluster in a database with partial sequence information of the cluster, calculating a degree of similarity for each partial sequence, and identifying a partial sequence with a degree of similarity greater than or equal to a predetermined degree of similarity as an organism lineage identification sequence.
A system for processing sequence information on a single biological unit, the system comprising:
(D) a ranking unit for ranking partial sequence information of sequence information on a plurality of single biological units from a highest quality based on a predetermined judgment criterion;
(E) a draft construction unit for selecting a population of a predetermined number of pieces of partial sequence information of sequence information on the plurality of single biological units from a highest ranking based on the ranking to construct a draft with a greater length than the partial sequence information from the partial sequence information; and
(E′) a selection unit for selecting a population of a set of a different number of pieces of partial sequence information of sequence information on the single biological units from the population, constructing a draft with a greater length than the partial sequence information from the partial sequence information, and selecting a draft created up to this point based on a predetermined judgment criterion.
The system for processing sequence information on a single biological unit of item 28, the system comprising:
(F) a selection unit for comparing the selected draft with partial sequence information of sequence information on a single biological unit that was not selected in (E) or (E′) and selecting partial sequence information of sequence information on the single biological unit having a sequence of a portion that is not included in the draft;
(G) a draft improvement unit for creating a longer draft by using the sequence information selected in (F) and the selected draft;
(G′) optionally, a draft construction unit for repeating draft creation in (G) until the longer draft reaches a full length of sequence information; and
(G″) optionally, means for repeating the ranking, draft construction, and selection in (D), (E), and (E′) of item 28 based on a judgment criterion with a lower criterion in the entire partial sequence information constituting the draft.
A system for processing sequence information on a single biological unit, the system comprising:
(A) a clustering unit for clustering partial sequence information of sequence information on a plurality of single biological units for each of the same lineages based on an organism lineage identification sequence;
(H) a reclustering unit for evaluating partial sequence information of sequence information on the plurality of single biological units constituting sequence information on a single biological unit based on an organism lineage identification sequence within the cluster of the same lineage, and reclustering within the cluster of the same lineage;
(H′) a comparison unit for comparing a sequence information draft created from the cluster of the same lineage with a sequence information draft created from the reclustered cluster; and
(I) a registration unit for determining whether reclustering in (H) is appropriate for a result of comparison based on a predetermined judgment criterion, and, if appropriate, registering the draft in a database as a new group.
The system of item 30, wherein the reclustering unit performs reclustering through network analysis and community detection.
A system for processing sequence information on a single biological unit, the system comprising:
(A) a clustering unit for clustering partial sequence information of sequence information on a plurality of single biological units for each of the same lineages based on an organism lineage identification sequence;
(D) a ranking unit for ranking partial sequence information of sequence information on the plurality of single biological units belonging to the cluster of the same lineage from a highest quality based on a predetermined judgment criterion;
(E) a draft construction unit for selecting a population of a predetermined number of pieces of partial sequence information of sequence information on the plurality of single biological units from a highest ranking based on the ranking to construct a draft with a greater length than the partial sequence information from the partial sequence information,
(E″) selecting a population of a set of a different number of pieces of partial sequence information of sequence information on the single biological units from the population, constructing a draft with a greater length than the partial sequence information from the partial sequence information, and evaluating a draft created up to this point based on a predetermined judgment criterion;
(H) a reclustering unit for evaluating partial sequence information of sequence information on the plurality of single biological units constituting sequence information on a single biological unit based on an organism lineage identification sequence within the cluster of the same lineage, and reclustering within the cluster of the same lineage, if evaluation of the draft does not change due to an increase in the number in a population of a set of sequence information;
(H′) a comparison unit for comparing a sequence information draft created from the cluster of the same lineage with a sequence information draft created from the reclustered cluster; and
(J) means for determining whether reclustering in (H) is appropriate for a result of comparison based on a predetermined judgment criterion, and, if appropriate, repeating (D) to (E′) for partial sequence information of sequence information on the plurality of single biological units belonging to the reclustered cluster.
A method of giving an instruction to a computer to execute processing of sequence information on a single biological unit, the computer given the instruction executing:
(A) a step of clustering partial sequence information of sequence information on a plurality of single biological units for each of the same lineages based on an organism lineage identification sequence; and
(C) a step of creating a sequence information draft for the single biological unit by using the partial sequence information of sequence information on the single biological unit and sequence information on the single biological unit in a database created independently from the clustering.
The method of the preceding item, further comprising: (B) a step of adding, to the cluster, partial sequence information on the single biological units corresponding to the cluster in the database.
The method of any one of the preceding items, wherein (C) comprises removing a certain amount of partial sequence information comprising a sequence site found to have a large number of duplications to correct a bias in sequence reads.
A method of giving an instruction to a computer to execute screening of candidates of an organism lineage identification sequence, the computer given the instruction executing:
A) a step of extracting genes without duplication from a draft in a database;
B) a step of calculating the number or a ratio of single copy genes for each of the genes; and
C) a step of selecting a gene with the number or ratio of single copy genes greater than or equal to a predetermined value as a candidate of an organism lineage identification sequence.
A method of giving an instruction to a computer to execute processing of sequence information on a single biological unit, the computer given the instruction executing:
(D) a step of ranking partial sequence information of sequence information on the plurality of single biological units from a highest quality based on a predetermined judgment criterion;
(E) a step of selecting a population of a predetermined number of pieces of partial sequence information of sequence information on the plurality of single biological units from a highest ranking based on the ranking to construct a draft with a greater length than the partial sequence information from the partial sequence information; and
(E′) a step of selecting a population of a set of a different number of pieces of partial sequence information of sequence information on the single biological units from the population, constructing a draft with a greater length than the partial sequence information from the partial sequence information, and selecting a draft created up to this point based on a predetermined judgment criterion.
The method of giving an instruction to a computer to execute processing of sequence information on a single biological unit of any one of the preceding items, the computer given the instruction executing:
(F) a step of comparing the selected draft with partial sequence information of sequence information on a single biological unit that was not selected in (E) or (E′) and selecting partial sequence information of sequence information on the single biological unit having a sequence of a portion that is not included in the draft;
(G) a step of creating a longer draft by using the sequence information selected in (F) and the selected draft;
(G′) optionally, a step of repeating (G) until the longer draft reaches a full length of sequence information; and
(G″) optionally, a step of repeating the steps of item 5 based on a judgment criterion with a lower criterion in the entire partial sequence information constituting the draft.
A method of giving an instruction to a computer to execute processing of sequence information on a single biological unit, the computer given the instruction executing:
(A) a step of clustering partial sequence information of sequence information on a plurality of single biological units for each of the same lineages based on an organism lineage identification sequence;
(H) a step of evaluating partial sequence information of sequence information on the plurality of single biological units constituting sequence information on a single biological unit based on an organism lineage identification sequence within the cluster of the same lineage, and reclustering within the cluster of the same lineage;
(H′) a step of comparing a sequence information draft created from the cluster of the same lineage with a sequence information draft created from the reclustered cluster; and
(I) a step of determining whether reclustering in (H) is appropriate for a result of comparison based on a predetermined judgment criterion, and, if appropriate, registering the draft in a database as a new group.
The method of any one of the preceding items, wherein the reclustering is performed through network analysis and community detection.
A method of giving an instruction to a computer to execute processing of sequence information on a single biological unit, the computer given the instruction executing:
(A) a step of clustering partial sequence information of sequence information on a plurality of single biological units for each of the same lineages based on an organism lineage identification sequence;
(D) a step of ranking partial sequence information of sequence information on the plurality of single biological units belonging to the cluster of the same lineage from a highest quality based on a predetermined judgment criterion;
(E) a step of selecting a population of a predetermined number of pieces of partial sequence information of sequence information on the plurality of single biological units from a highest ranking based on the ranking to construct a draft with a greater length than the partial sequence information from the partial sequence information;
(E″) a step of selecting a population of a set of a different number of pieces of partial sequence information of sequence information on the single biological units from the population, constructing a draft with a greater length than the partial sequence information from the partial sequence information, and evaluating a draft created up to this point based on a predetermined judgment criterion;
(H) a step of evaluating partial sequence information of sequence information on the plurality of single biological units constituting sequence information on a single biological unit based on an organism lineage identification sequence within the cluster of the same lineage, and reclustering within the cluster of the same lineage, if evaluation of the draft does not change due to an increase in the number in a population of a set of sequence information;
(H′) a step of comparing a sequence information draft created from the cluster of the same lineage with a sequence information draft created from the reclustered cluster; and
(J) a step of determining whether reclustering in (H) is appropriate for a result of comparison based on a predetermined judgment criterion, and, if appropriate, repeating (D) to (E′) for partial sequence information of sequence information on the plurality of single biological units belonging to the reclustered cluster.
The method of any one of the preceding items, wherein the partial sequence information is determined by long-read sequencing.
A program for implementing a method of processing sequence information on a single biological unit on a computer, the method comprising:
(A) a step of clustering partial sequence information of sequence information on a plurality of single biological units for each of the same lineages based on an organism lineage identification sequence; and
(C) a step of creating a sequence information draft for the single biological units by using the partial sequence information of sequence information on the single biological units and sequence information on the single biological units in a database created independently from the clustering.
The program of the preceding item, further comprising: (B) a step of adding, to the cluster, partial sequence information on the single biological units corresponding to the cluster in a database.
The program of any one of the preceding items, wherein (C) comprises removing a certain amount of partial sequence information comprising a sequence site found to have a large number of duplications to correct a bias in sequence reads.
A program for implementing a method of screening candidates of an organism lineage identification sequence on a computer, the method comprising:
A) a step of extracting genes without duplication from a draft in a database;
B) a step of calculating the number or a ratio of single copy genes for each of the genes; and
C) a step of selecting a gene with the number or ratio of single copy genes greater than or equal to a predetermined value as a candidate of an organism lineage identification sequence.
A program for implementing a method of processing sequence information on a single biological unit on a computer, the method comprising:
(D) a step of ranking partial sequence information of sequence information on the plurality of single biological units from a highest quality based on a predetermined judgment criterion;
(E) a step of selecting a population of a predetermined number of pieces of partial sequence information of sequence information on the plurality of single biological units from a highest ranking based on the ranking to construct a draft with a greater length than the partial sequence information from the partial sequence information; and
(E′) a step of selecting a population of a set of a different number of pieces of partial sequence information of sequence information on the single biological units from the population, constructing a draft with a greater length than the partial sequence information from the partial sequence information, and selecting a draft created up to this point based on a predetermined judgment criterion.
The program for implementing a method of processing sequence information on a single biological unit on a computer of any one of the preceding items, the method comprising:
(F) a step of comparing the selected draft with partial sequence information of sequence information on a single biological unit that was not selected in (E) or (E′) and selecting partial sequence information of sequence information on the single biological unit having a sequence of a portion that is not included in the draft;
(G) a step of creating a longer draft by using the sequence information selected in (F) and the selected draft;
(G′) optionally, a step of repeating (G) until the longer draft reaches a full length of sequence information; and
(G″) optionally, a step of repeating the steps of item 15 based on a judgment criterion with a lower criterion in the entire partial sequence information constituting the draft.
A program for implementing a method of processing sequence information on a single biological unit on a computer, the method comprising:
(A) a step of clustering partial sequence information of sequence information on a plurality of single biological units for each of the same lineages based on an organism lineage identification sequence;
(H) a step of evaluating partial sequence information of sequence information on the plurality of single biological units constituting sequence information on a single biological unit based on an organism lineage identification sequence within the cluster of the same lineage, and reclustering within the cluster of the same lineage;
(H′) a step of comparing a sequence information draft created from the cluster of the same lineage with a sequence information draft created from the reclustered cluster; and
(I) a step of determining whether reclustering in (H) is appropriate for a result of comparison based on a predetermined judgment criterion, and, if appropriate, registering the draft in a database as a new group.
The program of any one of the preceding items, wherein the reclustering is performed through network analysis and community detection.
A program for implementing a method of processing sequence information on a single biological unit on a computer, the method comprising:
(A) a step of clustering partial sequence information of sequence information on a plurality of single biological units for each of the same lineages based on an organism lineage identification sequence;
(D) a step of ranking partial sequence information of sequence information on the plurality of single biological units belonging to the cluster of the same lineage from a highest quality based on a predetermined judgment criterion;
(E) a step of selecting a population of a predetermined number of pieces of partial sequence information of sequence information on the plurality of single biological units from a highest ranking based on the ranking to construct a draft with a greater length than the partial sequence information from the partial sequence information;
(E″) a step of selecting a population of a set of a different number of pieces of partial sequence information of sequence information on the single biological units from the population, constructing a draft with a greater length than the partial sequence information from the partial sequence information, and evaluating a draft created up to this point based on a predetermined judgment criterion;
(H) a step of evaluating partial sequence information of sequence information on the plurality of single biological units constituting sequence information on a single biological unit based on an organism lineage identification sequence within the cluster of the same lineage, and reclustering within the cluster of the same lineage, if evaluation of the draft does not change due to an increase in the number in a population of a set of sequence information;
(H′) a step of comparing a sequence information draft created from the cluster of the same lineage with a sequence information draft created from the reclustered cluster; and
(J) a step of determining whether reclustering in (H) is appropriate for a result of comparison based on a predetermined judgment criterion, and, if appropriate, repeating (D) to (E′) for partial sequence information of sequence information on the plurality of single biological units belonging to the reclustered cluster.
The program of any one of the preceding items, wherein the partial sequence information is determined by long-read sequencing.
A recording medium storing a program for implementing a method of processing sequence information on a single biological unit on a computer, the method comprising:
(A) a step of clustering partial sequence information of sequence information on a plurality of single biological units for each of the same lineages based on an organism lineage identification sequence; and
(C) a step of creating a sequence information draft for the single biological units by using the partial sequence information of sequence information on the single biological units and sequence information on the single biological units in a database created independently from the clustering.
The recording medium of the preceding item, further comprising:
(B) a step of adding, to the cluster, partial sequence information on the single biological units corresponding to the cluster in the database.
The recording medium of any one of the preceding items, wherein (C) comprises removing a certain amount of partial sequence information comprising a sequence site found to have a large number of duplications to correct a bias in sequence reads.
A recording medium storing a program for implementing a method of screening candidates of an organism lineage identification sequence on a computer, the method comprising:
A) a step of extracting genes without duplication from a draft in a database;
B) a step of calculating the number or a ratio of single copy genes for each of the genes; and
C) a step of selecting a gene with the number or ratio of single copy genes greater than or equal to a predetermined value as a candidate of an organism lineage identification sequence.
A recording medium storing a program for implementing a method of processing sequence information on a single biological unit on a computer, the method comprising:
(D) a step of ranking partial sequence information of sequence information on a plurality of single biological units from a highest quality based on a predetermined judgment criterion;
(E) a step of selecting a population of a predetermined number of pieces of partial sequence information of sequence information on the plurality of single biological units from a highest ranking based on the ranking to construct a draft with a greater length than the partial sequence information from the partial sequence information; and
(E′) a step of selecting a population of a set of a different number of pieces of partial sequence information of sequence information on the single biological units from the population, constructing a draft with a greater length than the partial sequence information from the partial sequence information, and selecting a draft created up to this point based on a predetermined judgment criterion.
The recording medium storing a program for implementing a method of processing sequence information on a single biological unit on a computer of any one of the preceding items, the method comprising:
(F) a step of comparing the selected draft with partial sequence information of sequence information on a single biological unit that was not selected in (E) or (E′) and selecting partial sequence information of sequence information on the single biological unit having a sequence of a portion that is not included in the draft;
(G) a step of creating a longer draft by using the sequence information selected in (F) and the selected draft;
(G′) optionally, a step of repeating (G) until the longer draft reaches a full length of sequence information; and
(G″) optionally, a step of repeating the steps of item 25 based on a judgment criterion with a lower criterion in the entire partial sequence information constituting the draft.
A recording medium storing a program for implementing a method of processing sequence information on a single biological unit on a computer, the method comprising:
(A) a step of clustering partial sequence information of sequence information on a plurality of single biological units for each of the same lineages based on an organism lineage identification sequence;
(H) a step of evaluating partial sequence information of sequence information on the plurality of single biological units constituting sequence information on a single biological unit based on an organism lineage identification sequence within the cluster of the same lineage, and reclustering within the cluster of the same lineage;
(H′) a step of comparing a sequence information draft created from the cluster of the same lineage with a sequence information draft created from the reclustered cluster; and
(I) a step of determining whether reclustering in (H) is appropriate for a result of comparison based on a predetermined judgment criterion, and, if appropriate, registering the draft in a database as a new group.
The recording medium of any one of the preceding items, wherein the reclustering is performed through network analysis and community detection.
A recording medium storing a program for implementing a method of processing sequence information on a single biological unit on a computer, the method comprising:
(A) a step of clustering partial sequence information of sequence information on a plurality of single biological units for each of the same lineages based on an organism lineage identification sequence;
(D) a step of ranking partial sequence information of sequence information on the plurality of single biological units belonging to the cluster of the same lineage from a highest quality based on a predetermined judgment criterion;
(E) a step of selecting a population of a predetermined number of pieces of partial sequence information of sequence information on the plurality of single biological units from a highest ranking based on the ranking to construct a draft with a greater length than the partial sequence information from the partial sequence information;
(E″) a step of selecting a population of a set of a different number of pieces of partial sequence information of sequence information on the single biological unit from the population, constructing a draft with a greater length than the partial sequence information from the partial sequence information, and evaluating a draft created up to this point based on a predetermined judgment criterion;
(H) a step of evaluating partial sequence information of sequence information on the plurality of single biological units constituting sequence information on a single biological unit based on an organism lineage identification sequence within the cluster of the same lineage, and reclustering within the cluster of the same lineage, if evaluation of the draft does not change due to an increase in the number in a population of a set of sequence information;
(H′) a step of comparing a sequence information draft created from the cluster of the same lineage with a sequence information draft created from the reclustered cluster; and
(J) a step of determining whether reclustering in (H) is appropriate for a result of comparison based on a predetermined judgment criterion, and, if appropriate, repeating (D) to (E′) for partial sequence information of sequence information on the plurality of single biological units belonging to the reclustered cluster.
The recording medium of any one of the preceding items, wherein the partial sequence information is determined by long-read sequencing.
A system for processing sequence information on a single biological unit, the system comprising:
(A) a clustering unit for clustering partial sequence information of sequence information on a plurality of single biological units for each of the same lineages based on an organism lineage identification sequence; and
(C) a draft creation unit for creating a sequence information draft for the single biological units by using the partial sequence information of sequence information on the single biological units and sequence information on the single biological units in a database created independently from clustering by the clustering unit of (A).
The system of the preceding item, further comprising:
(B) an additional information addition unit for adding, to the cluster, partial sequence information on the single biological units corresponding to the cluster in the database.
The system of any one of the preceding items, wherein
(C) comprises removing a certain amount of partial sequence information comprising a sequence site found to have a large number of duplications to correct a bias in sequence reads.
A system for screening candidates of an organism lineage identification sequence, the system comprising:
A) an extraction unit for extracting genes without duplication from a draft in a database;
B) a calculation unit for calculating the number or a ratio of single copy genes for each of the genes; and
C) a selection unit for selecting a gene with the number or ratio of single copy genes greater than or equal to a predetermined value as a candidate of an organism lineage identification sequence.
A system for processing sequence information on a single biological unit, the system comprising:
(D) a ranking unit for ranking partial sequence information of sequence information on a plurality of single biological units from a highest quality based on a predetermined judgment criterion;
(E) a draft construction unit for selecting a population of a predetermined number of pieces of partial sequence information of sequence information on the plurality of single biological units from a highest ranking based on the ranking to construct a draft with a greater length than the partial sequence information from the partial sequence information; and
(E′) a selection unit for selecting a population of a set of a different number of pieces of partial sequence information of sequence information on the single biological units from the population, constructing a draft with a greater length than the partial sequence information from the partial sequence information, and selecting a draft created up to this point based on a predetermined judgment criterion.
The system for processing sequence information on a single biological unit of any one of the preceding items, the system comprising:
(F) a selection unit for comparing the selected draft with partial sequence information of sequence information on a single biological unit that was not selected in (E) or (E′) and selecting partial sequence information of sequence information on the single biological unit having a sequence of a portion that is not included in the draft;
(G) a draft improvement unit for creating a longer draft by using the sequence information selected in (F) and the selected draft;
(G′) optionally, a draft construction unit for repeating draft creation in (G) until the longer draft reaches a full length of sequence information; and
(G″) optionally, means for repeating the ranking, draft construction, and selection of (D), (E), and (E′) of item 35 based on a judgment criterion with a lower criterion in the entire partial sequence information constituting the draft.
A system for processing sequence information on a single biological unit, the system comprising:
(A) a clustering unit for clustering partial sequence information of sequence information on a plurality of single biological units for each of the same lineages based on an organism lineage identification sequence;
(H) a reclustering unit for evaluating partial sequence information of sequence information on the plurality of single biological units constituting sequence information on a single biological unit based on an organism lineage identification sequence within the cluster of the same lineage, and reclustering within the cluster of the same lineage;
(H′) a comparison unit for comparing a sequence information draft created from the cluster of the same lineage with a sequence information draft created from the reclustered cluster; and
(I) a registration unit for determining whether reclustering in (H) is appropriate for a result of comparison based on a predetermined judgment criterion, and, if appropriate, registering the draft in a database as a new group.
The system of any one of the preceding items, wherein the reclustering unit performs reclustering through network analysis and community detection.
A system for processing sequence information on a single biological unit, the system comprising:
(A) a clustering unit for clustering partial sequence information of sequence information on a plurality of single biological units for each of the same lineages based on an organism lineage identification sequence;
(D) a ranking unit for ranking partial sequence information of sequence information on the plurality of single biological units belonging to the cluster of the same lineage from a highest quality based on a predetermined judgment criterion;
(E) a draft construction unit for selecting a population of a predetermined number of pieces of partial sequence information of sequence information on the plurality of single biological units from a highest ranking based on the ranking to construct a draft with a greater length than the partial sequence information from the partial sequence information,
(E″) selecting a population of a set of a different number of pieces of partial sequence information of sequence information on the single biological units from the population, constructing a draft with a greater length than the partial sequence information from the partial sequence information, and evaluating a draft created up to this point based on a predetermined judgment criterion;
(H) a reclustering unit for evaluating partial sequence information of sequence information on the plurality of single biological units constituting sequence information on a single biological unit based on an organism lineage identification sequence within the cluster of the same lineage, and reclustering within the cluster of the same lineage, if evaluation of the draft does not change due to an increase in the number in a population of a set of sequence information;
(H′) a comparison unit for comparing a sequence information draft created from the cluster of the same lineage with a sequence information draft created from the reclustered cluster; and
(J) means for determining whether reclustering in (H) is appropriate for a result of comparison based on a predetermined judgment criterion, and, if appropriate, repeating (D) to (E′) for partial sequence information of sequence information on the plurality of single biological units belonging to the reclustered cluster.
The system of any one of the preceding items, wherein the partial sequence information is determined by long-read sequencing.
A data structure containing partial sequence information of sequence information on a plurality of single biological units clustered for each of the same lineages based on an organism lineage identification sequence.
The data structure of any one of the preceding items, wherein the partial sequence information contained in the data structure is derived from two or more independently clustered and created databases.
The data structure of any one of the preceding items, wherein information associated with the independently performed clustering is linked to and stored with the partial sequence information.
The data structure of any one of the preceding items, wherein the partial sequence information, as a whole, constitutes genomic information.
The data structure of any one of the preceding items, wherein the partial sequence information is collected for each single biological unit.
The data structure of any one of the preceding items, wherein the partial sequence information is linked to and stored with identification information (ID information) on a single biological unit from which the partial sequence information is derived.
A data structure for a single biological unit from integrating a plurality of data structures, containing partial sequence information of sequence information on a plurality of single biological units clustered for each of the same lineages based on an organism lineage identification sequence.
The data structure of item B7, further comprising one or more features of any one or more of the preceding items.
The present disclosure is intended so that one or more of the features described above can be provided not only as the explicitly disclosed combinations, but also as other combinations thereof. Additional embodiments and advantages of the present disclosure are recognized by those skilled in the art by reading and understanding the following detailed description as needed.
With the present disclosure, sequence information on a single biological unit at a single biological unit level can be provided more accurately. Use of the present disclosure enables elucidation of a nearly complete genome sequence of microorganisms that cannot be cultured and analysis of genetic heterogeneity between microorganisms of the same strain.
The present disclosure is described hereinafter while showing the best mode thereof. Throughout the entire specification, a singular expression should be understood as encompassing the concept thereof in the plural form, unless specifically noted otherwise. Thus, singular articles (e.g., “a”, “an”, “the”, and the like in the case of English) should also be understood as encompassing the concept thereof in the plural form, unless specifically noted otherwise. The terms used herein should also be understood as being used in the meaning that is commonly used in the art, unless specifically noted otherwise. Thus, unless defined otherwise, all terminologies and scientific technical terms that are used herein have the same meaning as the general understanding of those skilled in the art to which the disclosure pertains. In case of a contradiction, the present specification (including the definitions) takes precedence.
The definitions of the terms and/or basic technical matters especially used herein are described hereinafter when appropriate.
As used herein, “single biological unit” refers to a unit with genetic information or other information on a biomolecule. A single biological unit can include cells, cell-like constructs, and the like, but is not limited thereto. A single biological unit can also include artificially produced cells (so-called artificial cells), digital cells (provided as information), and the like.
As used herein, “cell” refers to any particle that encapsulates a molecule with genetic information and can be replicated (regardless of whether the cell can be replicated independently). As used herein, “cell” includes cells of unicellular organisms, bacteria, cells derived from a multicellular organism, fungi, and the like.
As used herein, “cell-like construct” refers to any particle that encapsulates a molecule with genetic information. As used herein, “cell-like construct” includes intracellular organelles such as mitochondria, cell nucleus, and chloroplast, viruses, and the like.
As used herein, “genetic information or other information on a biomolecule” refers to information specifying a biomolecule or an analog thereof. Genetic information or other information on a biomolecule can include structural information on a nucleic acid, amino acid, lipid, or sugar chain or an analog thereof, but is not limited thereto. Such information can also include information on diversity of interaction of a biomolecule or analog thereof such as a metabolite. “Genetic information” is also known as “nucleic acid information”, which are synonymous.
As used herein, “biomolecule” refers to a molecule of any organism or virus. A biomolecule can include a nucleic acid, protein, sugar chain, lipid, and the like. As used herein, “analog of a biomolecule” refers to a naturally-occurring or non-naturally-occurring variant of a biomolecule. An analog of a biomolecule can include a modified nucleic acid, modified amino acid, modified lipid, modified sugar chain, and the like.
As used herein, “population” refers to a collection including two or more single biological units, cells, or cell-like constructs.
As used herein, “subpopulation”, when used together with “population”, refers to a portion of a population with fewer number of single biological units, cells, or cell-like constructs than the population.
As used herein, “gel” refers to a colloidal solution (sol) wherein a polymeric substance or colloidal particles form a mesh structure as a whole due to the interaction thereof, without fluidity while containing a large quantity of a liquid phase that is a solvent or dispersion medium. As used herein, “gelation” refers to changing a solution into a state of “gel”.
As used herein, “capsule” refers to anything with a shape that can retain a cell or cell-like construct therein. As used herein, “gel capsule” refers to a gel-like microparticulate construct that can retain a cell or cell-like construct therein.
As used herein, “genetic analysis” refers to studying the state of a nucleic acid (DNA, RNA, or the like) in a biological sample. In one embodiment, genetic analysis includes those that utilize a nucleic acid amplification reaction. Examples of genetic analysis include, in addition thereto, sequencing, genotyping/polymorphism analysis (SNP analysis, copy number variation, restriction fragment length polymorphism, repeat number polymorphism), expression analysis, Quenching Probe (Q-Probe), SYBR green method, melt curve analysis, real-time PCR, quantitative RT-PCR, digital PCR, and the like.
As used herein, “single biological unit level” refers to processing genetic information or other information on a biomolecule contained in one single biological unit and genetic information or other information on a biomolecule contained in other single biological units in a distinguishable manner.
As used herein, “single cell level” refers to processing of genetic information or other information on a biomolecule contained in one cell or cell-like construct distinctly from genetic information or other information on a biomolecule contained in other cells or cell-like constructs. For example, when a polynucleotide is amplified at a “single biological unit level” or “single cell level”, a polynucleotide in a single biological unit or a cell or cell-like construct and a polynucleotide in another single biological unit or cell or cell like unit are each amplified in distinguishable manner. In one embodiment of the present disclosure, a step of contacting said polynucleotide with an amplification reagent to amplify the polynucleotide within a gel capsule can also amplify while maintaining the polynucleotide in a gel state within a gel capsule.
As used herein, “single biological unit analysis” refers to analysis of genetic information or other information on a biomolecule contained in one single biological unit (e.g., cell or cell-like construct) distinctly from genetic information or other information on a biomolecule contained in other single biological units (e.g., cells or cell-like constructs).
As used herein, “single cell analysis” refers to analysis of genetic information or other information on a biomolecule contained in one cell or cell-like construct distinctly from genetic information or other information on a biomolecule contained in other cells or cell-like constructs.
As used herein, “genetic information” refers to information on a nucleic acid encoding a gene or other information contained in one cell or cell-like construct, including the presence/absence of a specific genetic sequence, yield of a specific gene, and total nucleic acid yield.
As used herein, “information on a biomolecule” refers to information on a biomolecule contained in one cell or cell-like construct (including nucleic acid as well as protein, sugar, lipid, and the like) or an analog thereof, including the presence/absence of a structure or sequence of a specific biomolecule, identity of a structure or sequence, yield of a specific biomolecule, and total biomolecule yield.
As used herein, “nucleic acid information” refers to information on a nucleic acid contained in one cell or cell-like construct, including the presence/absence of a specific genetic sequence, yield of a specific gene, and total nucleic acid yield.
As used herein, “identity” refers to similarity in structures or sequences between two biomolecules. If a sequence is targeted, identity can be determined by comparing positions in each sequence that can be aligned for comparison.
As used herein, “long-read sequencing” is a method of sequencing the entire sequence using a long read (a nucleotide chain that has been fragmented for analysis). In general, long-read sequencing performs decoding using a read with a length of 400 bases or longer.
Preferred embodiments are described hereinafter. It is understood that the embodiments are exemplification of the present disclosure, and the scope of the present disclosure is not limited to such preferred embodiments. It is also understood that those skilled in the art can make appropriate modifications or changes within the scope of the invention by referring to the following preferred embodiments. Those skilled in the art can appropriately combine one or more of any of the embodiments.
(Sequence Information Processing)
In one aspect, the present disclosure provides a method of processing sequence information on a single biological unit (e.g., cell or cell-like construct). The method comprises: (A) a step of clustering partial sequence information of sequence information of a plurality of single biological units (e.g., collection of genomes, transcriptomes, proteomes, equivalent genes, or the like) for each of the same lineages based on an organism lineage identification sequence (e.g., 16S rDNA or a marker gene); (B) optionally, a step of adding, to the cluster, partial sequence information on the single biological units corresponding to the cluster in a database; and (C) a step of creating a sequence information draft for the single biological units by using the partial sequence information of sequence information on the single biological units and sequence information on the single biological units in the database.
Step (B) is an optional step, which may or may not utilize a database. In this manner, a clustering method can be a method utilizing a database (
An organism lineage identification sequence (marker) can also be newly identified from a database after classification. In this aspect, the present disclosure provides a method of processing sequence information on a single biological unit (e.g., cell), the method comprising: (A) a step of clustering partial sequence information of sequence information on a plurality of single biological units for each of the same lineages based on an organism lineage identification sequence; and (B) a step of comparing partial sequence information corresponding to the cluster in a database with partial sequence information of the cluster, calculating a degree of similarity for each partial sequence, and identifying a partial sequence with a degree of similarity greater than or equal to a predetermined degree of similarity as an organism lineage identification sequence. In such a case, the organism lineage identification sequence can be used as a so-called biomarker.
In one aspect, the present disclosure is a method of processing sequence information on a single biological unit, the method comprising: (D) a step of ranking partial sequence information of sequence information on a plurality of single biological units from a highest quality based on a predetermined judgment criterion (e.g., completion percentage or contamination percentage); (E) a step of selecting a population of a predetermined number of pieces of partial sequence information of sequence information on the plurality of single biological units from a highest ranking based on the ranking to construct a draft with a greater length (which can be a partial or full length) than the partial sequence information from the partial sequence information; and (E′) a step of selecting a population of a set of a different number of pieces of partial sequence information of sequence information on the single biological units from the population, constructing a draft with a greater length than the partial sequence information from the partial sequence information, and selecting a draft created up to this point based on a predetermined judgment criterion (e.g., completion percentage or contamination percentage). It is preferable to repeat (E′) because it is preferable to repeat draft creation while changing the number of SAGs. In some embodiments, the aforementioned (D) to (E′) can be performed as a step for creating a sequence information draft of the single biological unit.
In one preferred embodiment, the method of processing sequence information on a single biological unit of the present disclosure comprises: (F) a step of comparing the selected draft with partial sequence information of sequence information on a single biological unit that was not selected in (E) or (E′) and selecting partial sequence information of sequence information on the single biological unit having a sequence of a portion that is not included in the draft; (G) a step of creating a longer draft by using the sequence information selected in (F) and the selected draft; (G′) optionally, a step of repeating (G) preferably until the longer draft reaches a full length of sequence information; and (G″) optionally, a step of repeating steps (D), (E), and (E′) based on a judgment criterion with a lower criterion in the entire partial sequence information constituting the draft. For example, a looser parameter can be used as the judgment criterion with a lower criterion.
In one aspect, the partial sequence information is SAG. In a specific aspect, the present disclosure provides a method of refining a cluster in an aspect related to the stage immediate after determining that SAG is of the “same” cluster (e.g., lineage or species). In this aspect, the present disclosure is a method of processing sequence information on a single biological unit, the method comprising: (A) a step of clustering partial sequence information of sequence information on a plurality of single biological units for each of the same lineages based on an organism lineage identification sequence; (H) a step of evaluating partial sequence information of sequence information on the plurality of single biological units constituting sequence information on a single biological unit based on an organism lineage identification sequence within the cluster of the same lineage, and reclustering within the cluster of the same lineage; (H′) a step of comparing a sequence information draft created from the cluster of the same lineage with a sequence information draft created from the reclustered cluster; and (I) a step of determining whether reclustering in (H) is appropriate for a result of comparison based on a predetermined judgment criterion (e.g., completion percentage or contamination percentage), and, if appropriate, registering the draft in a database as a new group.
In this regard, the evaluation described above can evaluate extracted partial sequence information (e.g., SAGs) with a marker gene in a round robin format, and the evaluation can use, for example, the distance between each SAG.
In a preferred embodiment, reclustering in the present disclosure is performed through network analysis and community detection.
The present disclosure also provides processing in an aspect of the stage after draft quality no longer improves even after increasing the number of pieces of partial sequence information (e.g., SAGs). In this aspect, the present disclosure is a method of processing sequence information on a single biological unit, the method comprising: (A) a step of clustering partial sequence information of sequence information on a plurality of single biological units for each of the same lineages based on an organism lineage identification sequence; (D) a step of ranking partial sequence information of sequence information on the plurality of single biological units belonging to the cluster of the same lineage from a highest quality based on a predetermined judgement criterion (e.g., completion percentage or contamination percentage); (E) a step of selecting a population of a predetermined number of pieces of partial sequence information of sequence information on the plurality of single biological units from a highest ranking based on the ranking to construct a draft with a greater length (which can be a partial or full length) than the partial sequence information from the partial sequence information; (E″) a step of selecting a population of a set of a different number of pieces of partial sequence information of sequence information on the single biological unit from the population, constructing a draft with a greater length than the partial sequence information from the partial sequence information, and evaluating a draft created up to this point based on a predetermined judgment criterion (e.g., completion percentage or contamination percentage); (H) a step of evaluating partial sequence information of sequence information on the plurality of single biological units constituting sequence information on a single biological unit based on an organism lineage identification sequence within the cluster of the same lineage, and reclustering within the cluster of the same lineage, if evaluation of the draft does not change (i.e., remains within a certain range) due to an increase in the number in a population of a set of sequence information; (H′) a step of comparing a sequence information draft created from the cluster of the same lineage with a sequence information draft created from the reclustered cluster; and (J) a step of determining whether reclustering in (H) is appropriate for a result of comparison based on a predetermined judgment criterion (e.g., completion percentage or contamination percentage), and, if appropriate, repeating (D) to (E′) for partial sequence information of sequence information on the plurality of single biological units belonging to the reclustered cluster.
It is understood that each of the steps in these method can be appropriately combined in the present disclosure. When processing sequence information on a single biological unit and screening for candidates of an organism lineage identification sequence in some embodiments, the location from which an instruction to execute them is given to a computer can be different from the location where the instruction is received to actually perform these processing or the like. In another embodiment, each processing of the method of the present disclosure can be executed by a computer. In another embodiment, the database of the present disclosure can be a database created by the clustering or sequence analysis method of the present disclosure or a database created independently from the clustering or sequence analysis method of the present disclosure. In a preferred embodiment, a database created independently from the clustering or sequence analysis method of the present disclosure can be a database for data obtained by sequencing a sequence that is amplified based on single cell amplification. While it was understood that addition of a sequence in another database would lead to reduced quality in conventional art, it was found that the quality of sequences actually improves by adding a sequence in another database to a cluster.
In some embodiments where a draft genome is constructed from sequence data, a certain amount of partial sequence information comprising a sequence site found to have a large number of duplicate readings can be removed to correct (homogenize) a bias in sequence reads. Further improvement in genome quality is expected by repeated homogenization using a genome sequence created from homogenized sequence data as a reference sequence in response to clustering of sequence data that has been homogenized. If partial sequence information subjected to homogenization processing is read by long-read sequencing, even further improvement in genome quality is expected.
If a draft genome of a sequence derived from a single biological unit is constructed, this presumes that data itself is clean and has a certain degree of genome integrity, and a plurality of pieces of single cell data are obtained together. This could not be materialized with conventional art, but was materialized for the first time by the present disclosure. Further, a draft genome of a sequence derived from a single biological unit was never decoded by long-read sequencing. Since it was understood that a sequence derived from a single biological unit has a problem of producing a chimera (separate genome sequences that are not inherently connected are generated due to an error during amplification or the like to produce incorrectly decoded sequence data), a long-read assembly system that is suitable for single cell data with a chimera and high amplification bias was not developed. Such a bias can be drastically reduced by referring to a plurality of single cell genomes and repeating mapping and assembly by utilizing the present disclosure. This allows an extremely accurate genome sequence to be obtained.
It is well known that a bias is generated in a sequence of an amplified DNA such as a genome sequence derived from a single cell. In this regard, homogenization processing (for reducing bias) in conventional methods designs enzymatic reactions or reaction conditions so that bias itself is not likely to occur upon amplification (Nishikawa et al. PLoS ONE), or uses a method of proactively degrading a DNA to reduce a bias generated after amplification or the like. However, a problem with these methods was that biases could not be completely removed. Since the present disclosure executes in silico processing even on data with a bias, data can be homogenized without the special designs in the reaction system described above. Since it is presumed that data itself is clean and is derived from a plurality of origins, this could only be executed by the method utilized in the present disclosure. For accuracy of a genome sequence, conventional methods perform mapping on a reference genome of related species or the like and evaluate a bias, gap, etc., to correct the sequence. Meanwhile, the method utilized in the present disclosure achieves a particularly significant effect compared to conventional art in that data for an unknown microorganism sample without a reference sequence can also be homogenized because self data can be referenced to execute homogenization processing by comprehensively analyzing a plurality of pieces of data for the same species even without a relative species reference genome upon homogenization processing. Further, the method is extremely effective in decoding the complete genome of an unknown microorganism. The method can also decode a gene cluster by the entire sequence without a gap, without culturing, in cells wherein a gene cluster position in the genome has not been identified, and the function thereof can be understood in detail. Further, research and development that introduce the gene cluster into another organism that can be readily handled to create an intended substance is also possible. The following application examples/envisioned examples are expected.
(Program and Recording Medium)
In one aspect, the present disclosure provides a computer program for instructing a computer to implement a method of processing sequence information on a single biological unit (e.g., cell or cell-like construct) and a recording medium for storing the program (e.g., CD-R, flash memory, hard disk, transmission medium, cloud, or the like). The method implemented by the program comprises: (A) a step of clustering partial sequence information of sequence information on a plurality of single biological units (e.g., collection of genomes, transcriptomes, proteomes, equivalent genes, or the like) for each of the same lineages based on an organism lineage identification sequence (e.g., 16S rDNA or a marker gene); (B) optionally, a step of adding, to the cluster, partial sequence information on the single biological units corresponding to the cluster in a database; and (C) a step of creating a sequence information draft for the single biological units by using the partial sequence information of sequence information on a single biological units and sequence information on the single biological units in the database.
Step (B) is an optional step, which may or may not utilize a database. In this manner, a clustering method can be a method utilizing a database (
An organism lineage identification sequence (marker) can also be newly identified from a database after classification. In this aspect, the present disclosure provides a computer program for instructing a computer to implement a method of processing sequence information on a single biological unit (e.g., cell) and a recording medium for storing the program (e.g., CD-R, flash memory, hard disk, transmission medium, cloud, or the like). The method implemented by the program comprises: (A) a step of clustering partial sequence information of sequence information on a plurality of single biological units for each of the same lineages based on an organism lineage identification sequence; and (B) a step of comparing partial sequence information corresponding to the cluster in a database with partial sequence information of the cluster, calculating a degree of similarity for each partial sequence, and identifying a partial sequence with a degree of similarity greater than or equal to a predetermined degree of similarity as an organism lineage identification sequence. In such a case, an organism lineage identification sequence can be used as a so-called biomarker.
In one aspect, the present disclosure provides a computer program for instructing a computer to implement a method of processing sequence information on a single biological unit and a recording medium for storing the program (e.g., CD-R, flash memory, hard disk, transmission medium, cloud, or the like). The method implemented by the program comprises: (D) a step of ranking partial sequence information of sequence information on a plurality of single biological units from a highest quality based on a predetermined judgment criterion (e.g., completion percentage or contamination percentage); (E) a step of selecting a population of a predetermined number of pieces of partial sequence information of sequence information on the plurality of single biological units from a highest ranking based on the ranking to construct a draft with a greater length (which can be a partial or full length) than the partial sequence information from the partial sequence information; and (E′) a step of selecting a population of a set of a different number of pieces of partial sequence information of sequence information on the single biological units from the population, constructing a draft with a greater length than the partial sequence information from the partial sequence information, and selecting a draft created up to this point based on a predetermined judgment criterion (e.g., completion percentage or contamination percentage). It is preferable to repeat (E′) because it is preferable to repeat draft creation while changing the number of SAGs.
In a preferred embodiment, a method of processing sequence information on a single biological unit implemented by the program of the present disclosure comprises: (F) a step of comparing the selected draft with partial sequence information of sequence information on a single biological unit that was not selected in (E) or (E′) and selecting partial sequence information of sequence information on the single biological unit having a sequence of a portion that is not included in the draft; (G) a step of creating a longer draft by using the sequence information selected in (F) and the selected draft; (G′) optionally, a step of repeating (G) preferably until the longer draft reaches a full length of sequence information; and (G″) optionally, a step of repeating steps (D), (E), and (E′) based on a judgment criterion with a lower criterion in the entire partial sequence information constituting the draft. For example, a looser parameter can be used as the judgment criterion with a lower criterion.
In another aspect, the program of the present disclosure encodes a method of refining a cluster in an aspect related to the stage immediately after determining that SAG is of the “same” cluster (e.g., lineage or species). In this aspect, the present disclosure provides a computer program for instructing a computer to implement a method of processing sequence information on a single biological unit and a recording medium for storing the program (e.g., CD-R, flash memory, hard disk, transmission medium, cloud, or the like). The method implemented by the program comprises: (A) a step of clustering partial sequence information of sequence information on a plurality of single biological units for each of the same lineages based on an organism lineage identification sequence; (H) a step of evaluating partial sequence information of sequence information on the plurality of single biological units constituting sequence information on a single biological unit based on an organism lineage identification sequence within the cluster of the same lineage, and reclustering within the cluster of the same lineage; (H′) a step of comparing a sequence information draft created from the cluster of the same lineage with a sequence information draft created from the reclustered cluster; and (I) a step of determining whether reclustering in (H) is appropriate for a result of comparison based on a predetermined judgment criterion (e.g., completion percentage or contamination percentage), and, if appropriate, registering the draft in a database as a new group.
In this regard, the evaluation described above can evaluate extracted partial sequence information (e.g., SAGs) with a marker gene in a round robin format, and the evaluation can use, for example, the distance between each SAG. In a preferred embodiment, reclustering in the present disclosure is performed through network analysis and community detection.
The program of the present disclosure also provides processing in an aspect of the stage after draft quality no longer improves even after increasing the number of pieces of partial sequence information (e.g., SAGs). In this aspect, the present disclosure provides a computer program for instructing a computer to implement a method of processing sequence information on a single biological unit and a recording medium for storing the program (e.g., CD-R, flash memory, hard disk, transmission medium, cloud, or the like). The method implemented by the program comprises: (A) a step of clustering partial sequence information of sequence information on a plurality of single biological units for each of the same lineages based on an organism lineage identification sequence; (D) a step of ranking partial sequence information of sequence information on the plurality of single biological units belonging to the cluster of the same lineage from a highest quality based on a predetermined judgment criterion (e.g., completion percentage or contamination percentage); (E) a step of selecting a population of a predetermined number of pieces of partial sequence information of sequence information on the plurality of single biological units from a highest ranking based on the ranking to construct a draft with a greater length (which can be a partial or full length) than the partial sequence information from the partial sequence information; (E″) a step of selecting a population of a set of a different number of pieces of partial sequence information of sequence information on the single biological unit from the population, constructing a draft with a greater length than the partial sequence information from the partial sequence information, and evaluating a draft created up to this point based on a predetermined judgment criterion (e.g., completion percentage or contamination percentage); (H) a step of evaluating partial sequence information of sequence information on the plurality of single biological units constituting sequence information on a single biological unit based on an organism lineage identification sequence within the cluster of the same lineage, and reclustering within the cluster of the same lineage, if evaluation of the draft does not change (i.e., remains within a certain range) due to an increase in the number in a population of a set of sequence information; (H′) a step of comparing a sequence information draft created from the cluster of the same lineage with a sequence information draft created from the reclustered cluster; and (J) a step of determining whether reclustering in (H) is appropriate for a result of comparison based on a predetermined judgment criterion (e.g., completion percentage or contamination percentage), and, if appropriate, repeating (D) to (E′) for partial sequence information of sequence information on the plurality of single biological units belonging to the reclustered cluster.
In another aspect, the present disclosure provides a data structure containing partial sequence information of sequence information on a plurality of single biological units clustered for each of the same lineages based on an organism lineage identification sequence. In one embodiment, the partial sequence information contained in the data structure is derived from two or more independently clustered and created databases. In one embodiment, information associated with the independently performed clustering is linked to and stored with the partial sequence information. In one embodiment, the partial sequence information, as a whole, constitutes genomic information. In one embodiment, the partial sequence information is collected for each single biological unit. In one embodiment, the partial sequence information is linked to and stored with identification information (ID information) on a single biological unit from which the partial sequence information is derived.
In another embodiment, the present disclosure provides a data structure for a single biological unit from integrating a plurality of data structures, containing partial sequence information of sequence information on a plurality of single biological units clustered for each of the same lineages based on an organism lineage identification sequence. A high quality database integrating a single biological unit such as a single cell was not available in the past, and is provided for the first time by the present disclosure.
(System)
In one aspect, the present disclosure provides a system for processing sequence information on a single biological unit (e.g., cell or cell structure). The system comprises: (A) a clustering unit for clustering partial sequence information of sequence information on a plurality of single biological units (e.g., collection of genomes, transcriptomes, proteomes, equivalent genes, or the like) for each of the same lineages based on an organism lineage identification sequence (e.g., 16S rDNA or a marker gene); (B) optionally, an additional information addition unit for adding, to the cluster, partial sequence information on the single biological units corresponding to the cluster in a database (this can be the same or separate from the clustering unit); and (C) a draft creation unit for creating a sequence information draft for the single biological units by using the partial sequence information of sequence information on the single biological units and sequence information on the single biological units in the database.
The additional information addition unit corresponding to B) is optional, which may or may not utilize a database.
In this manner, a clustering method materialized by the clustering unit can be a method utilizing a database (
The system of the present disclosure can newly identify an organism lineage identification sequence (marker) from a database after classification. In this aspect, the present disclosure provides a system for processing sequence information on a single biological unit (e.g., cell). The system comprises: (A) a clustering unit for clustering partial sequence information of sequence information on a plurality of single biological units for each of the same lineages based on an organism lineage identification sequence; and (B) an identification unit (also referred to as marker identification unit) for comparing partial sequence information corresponding to the cluster in a database with partial sequence information of the cluster, calculating a degree of similarity for each partial sequence, and identifying a partial sequence with a degree of similarity greater than or equal to a predetermined degree of similarity as an organism lineage identification sequence. In such a case, an organism lineage identification sequence can be used as a so-called biomarker.
In one aspect, the present disclosure provides a system for processing sequence information on a single biological unit. The system comprises: (D) a ranking unit for ranking partial sequence information of sequence information on a plurality of single biological units from a highest quality based on a predetermined judgment criterion (e.g., completion percentage or contamination percentage); and (E) a draft construction unit for selecting a population of a predetermined number of pieces of partial sequence information of sequence information on the plurality of single biological units from a highest ranking based on the ranking to construct a draft with a greater length (which can be a partial or full length) than the partial sequence information from the partial sequence information, selecting a population of a set of a different number of pieces of partial sequence information of sequence information on a single biological unit from the population, constructing a draft with a greater length than the partial sequence information from the partial sequence information, and selecting a draft created up to this point based on a predetermined judgment criterion (e.g., completion percentage or contamination percentage). It is preferable to repeat draft creation a plurality of times because it is preferable to repeat draft creation while changing the number of pieces of partial sequence information (e.g., SAGs).
In a preferred embodiment, the system of the present disclosure comprises: (F) a selection unit for comparing the selected draft with partial sequence information of sequence information on a single biological unit that was not selected in (E) or (E′) and selecting partial sequence information of sequence information on the single biological unit having a sequence of a portion that is not included in the draft (this can be configured as a part of the draft construction unit); (G) a draft improvement unit for creating a longer draft by using the sequence information selected in (F) and the selected draft (this can also be configured as a part of the draft construction unit); (G′) optionally, a draft construction unit for repeating (G) preferably until the longer draft reaches a full length of sequence information; and (G″) optionally, means for repeating the ranking, draft construction, and selection in (D), (E), and (E′) based on a judgment criterion with a lower criterion in the entire partial sequence information constituting the draft. The repeat can be materialized in the draft construction unit or the like. For example, a looser parameter can be used as the judgment criterion with a lower criterion.
In another aspect, the system of the present disclosure encodes a method of refining a cluster in an aspect related to the stage immediately after determining that SAG is of the “same” cluster (e.g., lineage or species). In this aspect, the present disclosure provides a system for processing sequence information on a single biological unit. The system comprises: (A) a clustering unit for clustering partial sequence information of sequence information on a plurality of single biological units for each of the same lineages based on an organism lineage identification sequence; (H) a reclustering unit for evaluating partial sequence information of sequence information on the plurality of single biological units constituting sequence information on a single biological unit based on an organism lineage identification sequence within the cluster of the same lineage, and reclustering within the cluster of the same lineage (this can be materialized in the clustering unit); (H′) a comparison unit for comparing a sequence information draft created from the cluster of the same lineage with a sequence information draft created from the reclustered cluster (this can also be materialized in the clustering unit); and (I) a registration unit for determining whether reclustering in (H) is appropriate for a result of comparison based on a predetermined judgment criterion (e.g., completion percentage or contamination percentage), and, if appropriate, registering the draft in a database as a new group.
In this regard, the evaluation described above can evaluate extracted partial sequence information (e.g., SAGs) with a marker gene in a round robin format, and the evaluation can use, for example, the distance between each SAG.
In a preferred embodiment, reclustering in the present disclosure is performed through network analysis and community detection.
The system of the present disclosure also provides processing in an aspect of the stage after draft quality no longer improves even after increasing the number of partial sequence information (e.g., SAGs). In this aspect, the present disclosure provides a system for processing sequence information on a single biological unit. The system comprises: (A) a clustering unit for clustering partial sequence information of sequence information on a plurality of single biological units for each of the same lineages based on an organism lineage identification sequence; (D) a ranking unit for ranking partial sequence information of sequence information on the plurality of single biological units belonging to the cluster of the same lineage from a highest quality based on a predetermined judgment criterion (e.g., completion percentage or contamination percentage); (E) a draft construction unit for selecting a population of a predetermined number of pieces of partial sequence information of sequence information on the plurality of single biological units from a highest ranking based on the ranking to construct a draft with a greater length (which can be a partial or full length) than the partial sequence information from the partial sequence information, selecting a population of a set of a different number of pieces of partial sequence information of sequence information on the single biological unit from the population, constructing a draft with a greater length than the partial sequence information from the partial sequence information, and evaluating a draft created up to this point based on a predetermined judgment criterion (e.g., completion percentage or contamination percentage); (H) a reclustering unit for evaluating partial sequence information of sequence information on the plurality of single biological units constituting sequence information on a single biological unit based on an organism lineage identification sequence within the cluster of the same lineage, and reclustering within the cluster of the same lineage, if evaluation of the draft does not change (i.e., remains within a certain range) due to an increase in the number in a population of a set of sequence information (this can be materialized in the clustering unit); (H′) a comparison unit for comparing a sequence information draft created from the cluster of the same lineage with a sequence information draft created from the reclustered cluster (this can also be materialized in the clustering unit); and (J) a determination unit for determining whether reclustering in (H) is appropriate for a result of comparison based on a predetermined judgment criterion (e.g., completion percentage or contamination percentage), and, if appropriate, the determination unit repeats the steps of (D) to (E′) for partial sequence information of sequence information on the plurality of single biological units belonging to the reclustered cluster.
The system, program, recording medium, and method according to one or more embodiments of the present disclosure have been described based on the embodiments, but the present disclosure is not limited to such embodiments. Various modifications applied to the present embodiments and embodiments constructed by combining constituent elements in different embodiments that are conceived by those skilled in the art are also encompassed within the scope of one or more embodiments of the present disclosure, as long as such embodiments do not deviate from the intent of the present disclosure.
Some or all of the constituent elements of the present disclosure in each of the embodiments described above can be comprised of a single system LSI (Large Scale Integration). For example, the system for processing sequence information of the present disclosure can be optionally combined with a database, or can be equipped with or combined with a system for identifying a sequence with a function such as a biomarker (
System LSI is ultra-multifunctional LSI manufactured by integrating a plurality of constituents on a single chip, or specifically, a computer system comprised of a microprocessor, ROM (Read Only Memory), RAM (Random Access Memory), and the like. A computer program is stored in a ROM. The system LSI accomplishes its function by the microprocessor operating in accordance with the computer program. The term system LSI is used herein, but the term IC, LSI, super LSI, and ultra LSI can also be used depending on the difference in the degree of integration. The method for forming an integrated circuit is not limited to LSI. An integrated circuit can be materialized with a dedicated circuit or universal processor. After the manufacture of LSI, a programmable FPGA (Field Programmable Gate Array) or reconfigurable processor which allows reconfiguration of the connection or setting of circuit cells inside the LSI can be utilized. If a technology of integrated circuits that replaces LSI by advances in semiconductor technologies or other derivative technologies becomes available, functional blocks can obviously be integrated using such technologies. Application of biotechnology or the like is a possibility.
One aspect of the present disclosure can be not only such a sequence information processing device or system, but also a functionally specialized system (e.g., biomarker screening device, efficacy determination device, diagnostic device, etc.). Further, one embodiment of the present disclosure can be a computer program causing a computer to execute each characteristic step in sequence information processing. One embodiment of the present disclosure can also be a computer readable non-transient recording medium on which such a computer program is recorded.
In each of the embodiments described above, each constituent element can be materialized by being composed of a dedicated hardware or by executing a software program that is suited to each constituent element. Each constituent element can be materialized by a program execution unit such as a CPU or a processor reading out and executing a software program recorded on a recording medium such as a hard disk or semiconductor memory. In this regard, a software materializing the present disclosure of each of the embodiments described above or the like can be a program such as those described above herein.
The sequence information processing technology of the present disclosure can be provided in a form comprising all constituents as a single system or device. Alternatively, the technology can also be envisioned in a form of mainly displaying analysis and results as a sequence information processing device while calculation or differentiation model calculation is performed on a server or cloud. Some or all of them can be performed using IoT (Internet of Things) and/or artificial intelligence (AI) (
Alternatively, a sequence information processing device can also be envisioned in a semi-standalone form where means required for various calculations is stored and performs an analysis therein, but the calculations required for the analysis are performed on a server or cloud. Since transmission/reception is not always possible at some locations such as hospitals, this is a model envisioned for use when communication is blocked.
A storage unit can be a recording medium such as a CD-R, DVD, Blu-ray, USB, SSD, or hard disk. A storage unit can be stored in a server or in a form of appropriately recording on the cloud.
“Software as a service (SaaS)” mostly falls under such a cloud service. Since a sequence information processing device is understood to be installed with a differentiation algorithm made from data produced in a laboratory environment, the device can be provided as a system comprising two or three features of these embodiments.
Data can also be stored as needed. Data storage is generally equipped on the server side, but data storage can be at the terminal side for not only fully equipped models but also for cloud models (optional). When a service is provided on the cloud, options such as standard (e.g., up to 10 Gb on the cloud), option 1 (e.g., additional 10 Tb on the cloud), option 2 (parameter is set for divided storage on the cloud), and option 3 (analysis optionally stored on the cloud) can be provided for data storage. Data is stored, and data is imported from all sold devices to create big data (e.g., sequence database), and an analysis model is continuously updated or a new model is constructed so that new differentiation model software such as “disease determination model” can be provided.
There can also be data analysis options. In this regard, request of a user of a service provider or the like can be provided. In other words, this can be envisioned as an option for a calculation method.
As used herein, “or” is used when “at least one or more” of the listed matters in the sentence can be employed. When explicitly described herein as “within the range of two values”, the range also includes the two values themselves.
Reference literatures such as scientific literatures, patents, and patent applications cited herein are incorporated herein by reference to the same extent that the entirety of each document is specifically described.
As described above, the present disclosure has been described while showing preferred embodiments to facilitate understanding. While the present disclosure is described hereinafter based on Examples, the above descriptions and the following Examples are not provided to limit the present disclosure, but for the sole purpose of exemplification. Thus, the scope of the present disclosure is not limited to the embodiments or the Examples specifically described herein and is limited only by the scope of claims.
The Examples are described hereinafter.
For reagents, the specific products described in the Examples were used. However, an equivalent product from another manufacturer can also be used instead.
12 SAG data each for E. coli K12 (ATCC 10798) and B. subtilis (ATCC 6633) were obtained from Hosokawa et al. In the paper of Hosokawa et al., these cells were acquired from the ATCC. E. coli K12 was cultured in Luria-Bertani (LB) medium (1.0% Bacto-tryptone, 0.5% yeast extract, 1.0% NaCl, pH 7.0). B. subtilis was cultured in Brain Heart Infusion Broth (ATCC medium 44, Thermo Fisher Scientific, San Jose, Calif., USA). The collected cells were washed three times with UV-treated Phosphate-Buffered Saline (−) (PBS, Thermo Fisher Scientific) and subjected to single-droplet MDA and sequencing.
(Preparation of Mouse Gut Microbiota)
Feces was collected from a male 7-week-old ICR mouse (Tokyo Laboratory Animals Science Co., Ltd., Tokyo, Japan) and homogenized in PBS. The supernatant was recovered by centrifugation at 2000×g for 2 seconds, and centrifuged at 15000×g for 3 minutes. The resulting cell pellet was washed twice with PBS, and finally resuspended in PBS.
A microfluidic droplet generator and an MDA reaction device were fabricated and used for single-droplet MDA according to the report of Hosokawa et al. Prior to analysis, cell suspensions were adjusted to a concentration of 0.1 cells/droplet to prevent encapsulation of multiple cells in a single droplet. Using the droplet generator, single microbial cells were encapsulated in lysis buffer D2 (QIAGEN, Hilden, Germany), and lysed at 65° C. for 10 minutes. Cell lysates were then injected into a droplet fusion device and mixed with droplets of MDA reaction mix (REPLI-g Single Cell Kit, QIAGEN) supplemented with Tween-20 and EvaGreen. After collection in PCR tubes, the droplets were incubated at 30° C. for 2 hours and at 65° C. for 3 minutes. For single-cell sequencing, droplets that became fluorescent were individually picked and transferred by micropipette under an open clean bench (KOACH 500-F, KOKEN LTD., Tokyo, Japan) into fresh MDA reaction mix. After 2 hours of incubation at 30° C., the enzyme was inactivated at 65° C. for 3 minutes.
(16S rDNA Sequencing)
To confirm amplification from single cells, 16S rRNA gene fragments V3-V4 were amplified and sequenced by Sanger sequencing from SAGs obtained by single-droplet MDA. To compare the phylogenetic distribution, 16S rRNA fragments (V3-V4) were also amplified from a metagenomic sample of gut microbiota and sequenced by MiSeq (Illumina, San Diego, Calif., USA). Paired-end reads were connected, trimmed, and clustered by UPARSE into taxonomic units at 97% identity. Taxonomy was determined in RDP classifier.
Illumina libraries for single-cell sequencing were prepared from products of single-droplet MDA using Nextera XT DNA sample prep kit (Illumina) with Nextera XT Index Kit. Libraries were then sequenced on an Illumina MiSeq system at 2×300 paired-end reads.
(Quality Control of SAG Reads and Construction of Cross-Reference Contigs (Step 1 in ccSAG))
SAGs were first grouped based on 16S rRNA similarity 99% and ANI≥95%. Nucleotide identity was estimated by pairwise BLAST between full-length raw SAG contigs, and was calculated over ≥500 bp. Grouped SAG reads were then pre-filtered using FASTX-toolkit (http://hannonlab.cshl.edu/fastx_toolkit/) and PRINSEQ to remove low-quality reads (≥50% of bases with quality scores <25), trim the 3′-end of reads with low-quality bases (quality score <20), remove short reads (<20 bp) and reads with 1% of bases unidentified, and discard unpaired reads after such prefiltration. Subsequently, contigs were individually assembled de novo from raw SAG reads using SPAdes-3.9.0 with options-careful-disable-rr-sc. Finally, raw SAG contigs 500 bp were collected for cross-reference mapping.
(Removal of Chimeric Reads by Cross Reference Mapping (Step 2 of ccSAG))
Quality-controlled reads from one SAG were mapped by BWA to multiple raw contigs constructed from other SAGs in the same group. A read was considered clean if complete alignment to reference contigs was equally or more frequent than partial alignment (soft clipping), but considered potentially chimeric if partial alignment was more frequent than complete alignment. Potential chimeras were then split into aligned and unaligned fragments, which were then remapped to multiple raw contigs and reclassified as described. Finally, fully unaligned reads and fragmented chimeras shorter than 20 bp were discarded as unmapped. Cycles of cross-reference mapping and chimera splitting were repeated until partially aligned, potentially chimeric reads were undetectable.
(Co-Assembly of Clean SAGs and Contig Extension (Step 3 in ccSAG))
Clean reads from each SAG were co-assembled de novo using SPAdes into clean composite SAG contigs. Similarly, raw SAG reads were co-assembled de novo into raw composite SAG contigs. Gaps between clean composite contigs were filled by BLAST mapping against raw composite contigs. Briefly, potentially usable raw composite contigs were identified by 99% identity to clean composite contigs. Such raw composite contigs were then collected into a database, against which clean composite contigs were mapped by BLAST and gap-filled based on the resulting alignments, thereby generating bridged composite SAG contigs, which essentially comprise the composite single-cell genome.
Assembly quality was evaluated by QUASI (Gurevich A et al., Bioinformatics. 2013 Apr. 15; 29(8):1072-5.). For the analysis of cell lines, all sequence data were mapped to the NCBI reference genome of NC 00913 (E. coli substrain MG1655) with f-plasmid and lambda phage sequence or NCBI reference genome of NC 014479 (Bacillus subtilis subsp. spizizenii str. W23). For the analysis of uncultured cell genomes obtained by this Example, bridged composite SAG contigs were used as references to identify potential misassemblies and determine the genome fraction of each SAG. Completeness and contamination were evaluated by CheckM (Parks D H et al., Genome Res. 2015 July; 25(7): 1043-55.). Taxonomy was assigned in AMPHORA2 or by BLAST search of 16S rDNA sequences in RNAmmer (Lagesen K et al., Nucleic Acids Res. 2007; 35(9):3100-8.). Gene pathway analysis was performed in KAAS (Moriya Y et al., Nucleic Acids Res. 2007 July; 35 (Web Server issue): W182-5.) and MAPLE (Takami H et al., DNA Res. 2016 Jul. 3. pii: dsw030.), while assembly graphs were generated in Bandage (Wick R R et al., Bioinformatics. 2015 Oct. 15; 31(20): 3350-2.). For the analysis of SNPs, each single-cell-amplified genome was mapped onto the coding sequences of the bridged composite SAG contigs, and then the nucleotides were screened for sites with a coverage depth of at least 5 reads where 99.9% of reads did not match the reference and showed homogeneous bases (nucleic acid sequence). After that, nucleotide sites that contained both multiple matched SAGs and unmatched SAGs in same strains were identified as SNPs.
Once single biological unit genome analysis is completed, provisional phylogenetic classification in the draft genomic information table of a microorganism genome database is referenced to extract corresponding draft genomic information and genetic information. The marker type of genetic information is referenced to obtain an organism lineage identification sequence. A gene of the same protein family as the protein family of the organism lineage identification sequence is extracted from the genetic information in the single biological unit genomic data. If there is no corresponding genetic information, the processing ends to transition to the next processing. If there is corresponding genetic information, homology search is performed with a homology analysis tool such as BLAST on the gene base sequence in unit genomic data and organism lineage identification sequence in a round robin format. Since only pairs with homology at or above a certain threshold value are targeted, pairs at or below a certain threshold value (e.g., homology of 70% or less) are excluded. A gene base sequence in unit genomic data with the highest homology in each organism lineage identification sequence is detected. The weighted average of homology and matched base sequence length are found as the degree of similarity (distance) between two genomes. If a plurality of draft genomes with the same degree of similarity are detected, homology is searched between assembled base sequences, instead of with an organism lineage identification sequence, in a round robin format. The degree of similarity is calculated by performing the same processing as with an organism lineage identification sequence. The draft genome with the highest degree of similarity is used as the baseline for clustering.
It is also understood that the method proposed in D. H. Parks, et. al., 2015 can also be applied as a method of creating an organism lineage identification sequence that is different from the method described above. This is a method of creating a phylogenetic tree of the draft genome and defining an organism lineage identification sequence for each node, which is used as input data for checkM.
A higher quality genome can be constructed as shown in
(Amplification) Bias homogenization is performed to improve the quality of a genome sequence obtained by assembly of sequence data including a bias. Specifically, a certain amount of sequence reads of a sequence site found to have a large number of duplications is removed based on results of mapping sequence reads to a reference genome sequence to correct a bias in the sequence reads for homogenization (
For the reference genome sequence, the genome of a known relative organism species or a DNA sequence created by assembly of sequence data itself on which bias homogenization is performed can be used. The resulting draft genome complement ratio or sequence fragment count is improved by assembly of sequence data that has been homogenized. Depending on the situation, further improvement in genome quality is expected by repeated homogenization using a genome sequence created from homogenized sequence data as a reference sequence. Specifically, the following was performed.
A genome was assembled using nanopore sequence data (GridION) on E. coli K12 strain single cell amplified genome (SAG). Sequence data with significantly different read depth for each genome region (
E. coli genome
As described above, the present disclosure is exemplified by the use of its preferred embodiments. However, it is understood that the scope of the present disclosure should be interpreted solely based on the Claims. It is also understood that any patent, any patent application, and any references cited herein should be incorporated herein by reference in the same manner as the contents are specifically described herein. The present application claims priority to Japanese Patent Application No. 2019-85839 filed on Apr. 26, 2019 with the Japan Patent Office. It is understood that the entire content thereof is incorporated herein by reference in the same manner as if the contents are specifically described herein.
Automation of processing of single cell data of microorganisms and the like is enabled.
Number | Date | Country | Kind |
---|---|---|---|
2019-085839 | Apr 2019 | JP | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2020/017795 | 4/24/2020 | WO |