This application claims a priority to Chinese Patent Application No. 202210463446.2, filed on Apr. 28, 2022, the contents of which are hereby incorporated by reference.
The application relates to a technical field of gene identification, and in particular to a group of single nucleotide polymorphism (SNP) loci and a method for identifying biogeographic origins of East Asian populations.
Ancestry informative markers refer to genetic markers representing higher allele frequency differences in populations. They may analyze a biogeographic origin of unknown individuals and may also be used to identify potential substructures in a population. The former role may provide directional clues for judicial investigations in forensic medicine research; the latter role may control factors of population stratification in whole-genome association study, so as to avoid false positive or false negative results. At present, forensic scientists usually pay attention to identifications of major intercontinental populations. Up to now, several populations of ancestry informative markers for forensic ancestry analysis of different intercontinental populations have been reported. However, there is relatively little research on the forensic ancestry analysis of populations in a same continent or populations in the major intercontinental populations.
For the analysis of biogeographic origins of unknown individuals, forensic scientists usually use a principal component analysis method or a population genetic structure analysis method. The principal component analysis method performs a dimension-reduction method on all samples according to information of all loci, transforms variable information into several important principal components, each sample has a specific position in the different principal components, and then infers the possible biogeographic origin of the individual according to the distribution of samples in the different principal components. The population genetic structure analysis method estimates a proportion of individual ancestry components based on Bayesian method, and then determines the origin of individual ancestry according to the distribution of ancestry components by comparing with the reference population. However, these two methods may not be able to obtain more accurate prediction results for individuals with mixed history.
Single nucleotide polymorphism (SNP) is a sequence polymorphism formed by the variation of a single nucleotide in the genome. It has advantages of a wide distribution and a low mutation rate in the genome, and has high application value in the forensic research. In addition, previous studies have found that some single nucleotide polymorphisms show high differences in allele frequency distribution among different populations, and may be used as ancestry information markers to analyze the biogeographic origins of different populations.
The objective of the application is to provide a group of single nucleotide polymorphism (SNP) loci and a method for identifying biogeographic origins of East Asian populations, so as to solve the problems existing in the prior art, and these loci may be u sed to identify Beijing Han population, Southern Han population, Dai population, Japanese and Kinh Population from Vietnam.
In order to achieve the above objectives, the application provides following schemes:
Optionally, the biogeographic origins of the East Asian populations include Beijing Han population, Southern Han population, Dai population, Japanese and Kinh Population from Vietnam.
The application also provides a method for analyzing the biogeographic origins of the East Asian populations, including steps of screening the group of whole-genome single nucleotide polymorphism loci for identifying the biogeographic origins of the East Asian populations.
Optionally, the steps are as follows:
Optionally, in the step (1), principles of preliminarily screening relatively highly differentiated single nucleotide polymorphism loci in the East Asian populations include:
Optionally, using a principal component analysis method to evaluate an analytic efficiency of the single nucleotide polymorphism loci preliminarily screened in the step (1) on the East Asian populations is further included between the step (1) and the step (2).
Optionally, the following is further included: constructing a prediction model by using the single nucleotide polymorphism loci obtained in the step (2) by re-screening, and evaluating an identification efficiency on the biogeographic origins of the East Asian populations.
The application also provides an application of a group of whole-genome single nucleotide polymorphism loci used for identifying the biogeographic origins of the East Asian populations in forensic medicine and population genetics researches.
The application discloses following technical effects:
The application provides a group of single nucleotide polymorphism loci with high genetic differentiation in the East Asian populations. Compared with the previous different intercontinental populations, the loci in the application may be well used to analyze the biogeographic origins of the East Asian populations, which may provide more valuable information for forensic medicine and population genetics researches.
The application provides a method for analyzing the biogeographic origins of the East Asian populations based on the single nucleotide polymorphism loci. Compared with the conventional methods of principal component analysis and population genetic structure analysis, the method disclosed in the application is simple, fast, accurate and easy to interpret.
In order to explain the embodiments of the application or the technical scheme in the prior art more clearly, the drawings needed in the embodiments are briefly introduced below. Obviously, the drawings described below are only some embodiments of the application, and other drawings may be obtained according to these drawings without creative work for ordinary people in the field.
A number of exemplary embodiments of the application are described in detail now, and this detailed description should not be considered as a limitation of the application, but should be understood as a more detailed description of certain aspects, characteristics and embodiments of the application.
It should be understood that the terminology described in the application is only for describing specific embodiments and is not used to limit the application. In addition, for the numerical range in the application, it should be understood that each intermediate value between the upper limit and the lower limit of the range is also specifically disclosed. The intermediate value within any stated value or stated range and every smaller range between any other stated value or intermediate value within the stated range are also included in the application. The upper and lower limits of these smaller ranges may be independently included or excluded from the range.
Unless otherwise specified, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application relates. Although the application only describes the preferred methods and materials, any methods and materials similar or equivalent to those described herein may also be used in the practice or testing of the application. All documents mentioned in this specification are incorporated by reference to disclose and describe methods and/or materials related to the documents. In case of conflict with any incorporated document, the contents of this specification shall prevail.
It is obvious to those skilled in the art that many improvements and changes may be made to the specific embodiments of the application without departing from the scope or spirit of the application. Other embodiments are apparent to the skilled person from the description of the application. The specification and example of this application are only exemplary.
The terms “including”, “comprising”, “having” and “containing” used in the application are all open terms, which means including but not limited to.
Embodiment 1 A Method for Analyzing Biogeographic Origins of East Asian Populations
The software used in the application mainly includes PLINK, YModel and R software, and are used for screening single nucleotide polymorphism (SNP) loci for identifying biogeographic origins of five East Asian populations: Beijing Han population, Southern Han population, Dai population, Japanese and Kinh Population from Vietnam.
Firstly, preliminarily screening relatively highly differentiated single nucleotide polymorphism loci of five East Asian populations: downloading the whole-genome data of the East Asian populations from the international 1000 genomes; using PLINK software, inputting following codes: ‘plink--bfile all--hwe 0.0001-- maf 0.01--make-bed--out new’, and based on all East Asian individuals, excluding those single nucleotide polymorphism loci whose P value of HWE is less than 0.0001 and the minimum allele frequency is less than 0.01; then using ‘plink--bfile new--indep--pairwise 50 5 0.6’ to keep those single nucleotide polymorphism loci with paired r2 values less than 0.6; using ‘plink--bfile new3--within pop.txt--fst’ to calculate a fixed coefficient of each locus in the East Asian populations, and select single nucleotide polymorphism loci with fixed coefficients>0.06; eliminating those loci located in major histocompatibility complex (MHC) region; re-screening the following loci to select those loci with high genetic differentiation among paired populations, and the specific principles are as follows:
Finally, screening again the above loci by using ‘plink--bfile all--hwe 0.0001--maf 0.01--within pop.txt--make-bed--out new’ and ‘plink--bfile new--within pop.txt--indep- pairwise 50 5 0.6’, and eliminating the single nucleotide polymorphism loci with the P value of HWE less than 0.0001, the minimum allele frequency less than 0.01 and the paired r2 value greater than 0.6 in each population. Finally, the application retains 677 single nucleotide polymorphism loci. A flow chart of the above-mentioned single nucleotide polymorphism loci screening is shown in
Next, the principal component analysis method commonly used at present is adopted to evaluate the analytic efficiency of 677 single nucleotide polymorphism loci for the East Asian populations. The specific operation are as follows:
Re-screening of 677 single nucleotide polymorphism loci: the application adopts a machine learning algorithm XGBoost, and re-screens 677 single nucleotide polymorphism loci by using an optimal subset method, and finally determines 258 single nucleotide polymorphism loci. Using 677 and 258 single nucleotide polymorphism loci to construct prediction models respectively, evaluating an identification efficiency on the biogeographic origins of the East Asian populations, and confusion matrix of the predicted results and the actual sample results are shown in
The above-mentioned embodiments only describe the preferred mode of the application, and do not limit the scope of the application. Under the premise of not departing from the design spirit of the application, various modifications and improvements made by ordinary technicians in the field to the technical scheme of the application shall fall within the protection scope determined by the claims of the application.
Number | Date | Country | Kind |
---|---|---|---|
202210463446.2 | Apr 2022 | CN | national |