This application claims the priority benefit of Taiwan application serial no. 111107822, filed on Mar. 3, 2022. The entirety of the above-mentioned patent application is hereby incorporated by reference herein and made a part of this specification.
The present disclosure relates to a method for constructing a database, and more particularly, to a method for constructing a ribosomal RNA database.
In recent years, the rapid development of high-throughput gene sequencing technology has expedited the research on microbial organisms, and the amount of microbial sequence data has been considerably expanded. Specifically, ribosomal RNA genes are often used as genetic marker of microorganism for species classification to be performed, thereby inferring the regulatory functions played by bacteria in the human body. 20 Among the ribosomal RNA genes, the 16S of prokaryotes (including archaea and bacteria) and the 18S small subunit rRNA (SSU rRNA) of eukaryotes are the most important genetic markers. In addition, 23S/28S large subunit rRNA (LSU rRNA) is analyzed together with adjacent SSU rRNA, so that more species classification information may be obtained.
In the large database of microorganism, the correctness and integrity of the data might directly or indirectly affect the subsequent analysis and prediction results of various microbial phases. At present, the main sequence databases may be divided into two categories: native repository database and value-added database. The native repository database is mainly the International Nucleotide Sequence Database Collaboration (INSDC). Members of the database include NCBI, EMBL, and DDBJ, which are mainly uploaded by researchers to provide sequences and related species classification information. This type of database has the largest number of sequences, but there is a lot of data noise and too much invalid information is involved. Value-added databases, such as SILVA, EzBioCloud, Greengenes, mainly include sequences in the INSDC database, and then perform redundant sequence exclusion and high-similarity sequence aggregation. Unknown sequences are subjected to sequence comparison or evolutionary tree analysis to give name or specific number to species. The above method may further reduce the amount of data in the native repository database. However, because of the inconsistency in the processing of unknown sequences, the classification information of the sequences might be erroneous.
The two types of databases mentioned above are lack of normalization and homogenization of classification information. In subsequent related microbiological analysis, the results of prediction are often affected by misplacement of classification information or minor discrepancies of characters. Therefore, developing a method for constructing a ribosomal RNA database capable of increasing the accuracy of the data and improving the prediction accuracy is an important issue for current research.
The present disclosure provides a method for constructing a ribosomal RNA database, which may increase the accuracy of the data to improve the prediction accuracy, and may be applied to various subsequent analysis methods to maintain the consistency and accuracy of results.
In the disclosure, a construction method of ribosomal RNA database includes the following steps: selecting a source of nucleic acid sequence database; performing normalization and homogenization on species classification rules; using AI technology for normalized classification and naming; selecting the kingdom to which the sequence species belongs; filtering out redundant sequences and sequences with inconsistent lengths; setting a threshold for unknown bases other than A, T, C or G, and excluding unknown bases that exceed the threshold; and excluding sequences with insufficient classification information.
In an embodiment of the present disclosure, the nucleic acid sequence database includes a native repository database or a value-added database.
In an embodiment of the present disclosure, the ribosomal RNA database includes a 16S rRNA gene database.
In an embodiment of the present disclosure, a seventh-order nomenclature is used for normalization to form a hierarchy relation table. The hierarchies defined in the seventh-order nomenclature include kingdom, phylum, class, order, family, genus, and species.
In an embodiment of the present disclosure, the method for homogenization includes finding out information of other hierarchy in the classification hierarchy relation table based on species names in the nucleic acid sequence database, or using the serial number as a search target for comparison with a database that stores serial numbers based on the serial number of species in the nucleic acid sequence database. After the species name of the serial number is found, the information of other hierarchy may be found from the classification hierarchy relation table.
In an embodiment of the present disclosure, the step of using AI technology to perform normalized classification and naming includes performing comparison according to the species hierarchy, so as to confirm that there is no repetition in the sequence classification information.
In an embodiment of the present disclosure, the step of selecting the kingdom to which the sequence species belongs includes selecting sequences belonging to the kingdom of Archaea and the kingdom of Bacteria directed at the 16S rRNA gene database, and excluding other kingdoms or sequences where the kingdom name is mistakenly named as Archaea or Bacteria.
In an embodiment of the present disclosure, in the 16S rRNA gene database, when the sequence contains the same species sequence with 100% identical conditions, the sequence is a redundant sequence.
In an embodiment of the present disclosure, in the 16S rRNA gene database, the sequences with inconsistent lengths are those that are shorter than 1200 bases or longer than 1800 bases in length.
Based on the above, the construction method of ribosomal RNA database of the present disclosure includes retrieving high-quality sequence data from the value-added database, and normalizing and homogenizing the classification information. In this way, not only that the sequences with high representativeness may be effectively filtered out, but also the amount of data may be reduced while the coverage of species at all hierarchies of classification may be increased. The database constructed through this process may be applied to various subsequent analysis methods to maintain the consistency and accuracy of results.
As used herein, a range defined by “one value to another value” is a general description that avoids listing all the values in a range in the specification. Therefore, the recitation of a particular numerical range includes any numerical value within the numerical range and a smaller numerical range defined by any numerical value within the numerical range, and such recitation is equivalent to explicitly describing said any numerical value and said smaller numerical value in the specification.
The following examples will be described in detail in conjunction with the accompanying drawings, but the provided examples are not intended to limit the scope of the present disclosure.
The present disclosure provides a method for constructing a ribosomal RNA database.
Please refer to
Next, please continue to refer to
Next, please further refer to
Then, please continue to refer to
Thereafter, please continue to refer to
Redundant sequences and sequences with inconsistent lengths are filtered out. In terms of filtering out redundant sequences, bacterial strains might contain one or more sets of 16S rRNAs with the same sequence. Due to the high degree of conservation of 16S rRNAs, different subtypes of the same species might have exactly the same sequences. When the sequence contains the same species sequence with 100% identical conditions, it is regarded that the sequence is a redundant sequence and should be filtered out. In terms of sequences with inconsistent lengths, the full length of 16S rRNA is about 1600 bases. Studies show that it is necessary to use sequences covering 9 variable regions in order to accurately identify bacterial strains in the hierarchy of species. If the sequence length is too short, the sequence range for identification is insufficient, which might lead to misclassification of species. If the sequence is too long, it means that the sequence contains two or more sets of 16S rRNA, and other genes might be mixed between the 16S rRNAs, which will also affect the accuracy of species classification. Exclusion conditions for length of sequences are, for example, defined as sequences with shorter than 1200 bases or more than 1800 bases in length.
Next, please continue to refer to
Finally, please continue to refer to
To sum up, the present disclosure provides a method for constructing a ribosomal RNA database, including multiple filtering processes and ensuring the integrity and interpretability of the sequence species classification hierarchy. It is expected to increase the accuracy for processing ribosomal RNA sequence data analysis, so as to improve the prediction accuracy of microbial phase. By using the construction method of a ribosomal RNA database of the present disclosure, a high-quality and high-accuracy ribosomal RNA database may be established, and the ribosomal RNA database may be used for cross-comparison with the data adopting the standard classification nomenclature, and the method of the disclosure may be directly applied to the analytical process of microbial phase.
More specifically, the construction method of a ribosomal RNA database of the present disclosure may ensure that the most important sequence names are not likely to be misspelt or mistaken based on the ribosomal RNA database that is normalized and homogenized while having cross-database comparability. After the database is filtered by setting multiple conditions, the amount of data is considerably reduced, which helps to reduce the calculation time and the database is easier to maintain. The constructed ribosomal RNA database is suitable for use as a standard database for comparison, for comparing with unknown sequences obtained by researchers, so the sequence information in the database must be representative and informative. Therefore, exclusion of sequences with a large number of ambiguous or highly unknown bases may improve the interpretability of analysis results.
Number | Date | Country | Kind |
---|---|---|---|
111107822 | Mar 2022 | TW | national |