The present disclosure relates to the field of biotechnology, in particular non-invasive prenatal genetic testing, and specifically to a method and apparatus for determining the pregnancy status of a pregnant woman and a corresponding method and apparatus for constructing a machine learning prediction model.
The cell-free DNAs (cfDNA) of plasma of pregnant women contain fetal cfDNAs. These fetal cfDNAs are mainly derived from placenta, and partially derived from hemopoietic stem cells or directly derived from exchange between fetus and mother body. Studies have confirmed that the concentration of fetal cfDNAs in the plasma of pregnant women is correlated with various pregnancy complications such as premature delivery, intrauterine growth retardation, and pregnancy eclampsia.
Research articles about the correlation between fetal cfDNA concentration in the plasma of pregnant women and premature delivery have emerged constantly in recent years. However, there is no definite conclusion on the correlation between fetal cfDNA concentration and premature delivery, and there are contradictory conclusions in different research literatures.
Currently, methods for effectively predicting premature delivery based on the fetal cfDNA concentration remain to be developed.
The present disclosure is provided based on the discovery and recognition by the inventors of the following facts and issues:
To date, most of clinical predictions of threatened premature delivery are conducted by detecting the secretion of Fetal Fibronectin in the vagina of pregnant women, but this method is only an auxiliary means and cannot be used as the final diagnosis basis. At present, there is no effective method to diagnose premature delivery in clinic.
Several reports have shown that the concentration of fetal cfDNAs in the plasma of pregnant women is correlated with various pregnancy complications, such as premature delivery and preeclampsia. Studies have attempted to predict premature delivery using the fetal cfDNA concentration as a marker, but eventually failed due to insufficient correlation. To date, there is no effective method to predict premature delivery using a fetal cfDNA concentration.
There is a high false-positive problem in the method for the diagnosis of premature delivery assisted with fetal fibronectin molecule in clinic. Statistics show that in pregnant women diagnosed as positive by fetal fibronectin molecule, only less than 3% of the samples were finally diagnosed as premature delivery. The high false-positive problem makes this diagnostic method questionable.
A previously reported method for predicting the premature delivery by only using a single factor, a concentration of fetal cfDNAs in the plasma of pregnant women, has the problem of insufficient correlation, failing to successfully establish an effective prediction model.
Additional aspects and advantages of the present disclosure will be set forth in part in the description which follows and, in part, will be apparent from the description, or may be learned by practice of the present disclosure.
According to one aspect of the present disclosure, provided is a method for constructing a prediction model for determining a pregnancy status of a pregnant woman according to embodiments of the present disclosure, including: (i) constructing a training set and a selective validation set, each of the training set and the validation set being composed of a plurality of pregnant woman samples each having a known pregnancy status; (ii) determining predetermined parameters of each pregnant woman sample in the training set, the predetermined parameters including a concentration of fetal cell-free nucleic acids in peripheral blood of the pregnant woman and a gestational age in week at which sampling for the peripheral blood of the pregnant woman is conducted; and (iii) constructing the prediction model based on the known pregnancy status and the predetermined parameters. According to the method provided by the embodiments of the present disclosure, a prediction model for the pregnancy status of the pregnant woman is constructed by utilizing the concentration of fetal cell-free nucleic acids obtained via one-time blood sampling for a plurality of pregnant woman samples, the gestational age in week at which the sampling is conducted, the physical signs (such as height, body weight, body mass index, and age) of the pregnant woman when the sampling is conducted, and the pregnancy status (such as premature delivery and gestational age in week at delivery) of the pregnant woman when the sampling is conducted, and the method includes two key factors, the concentration of fetal cell-free nucleic acids and the gestational age in week at which the sampling is conducted, so that the accuracy of the model is improved.
According to embodiments of the present disclosure, the above-mentioned method may further have at least one of the following additional technical features:
According to embodiments of the present disclosure, the pregnancy status includes a delivery interval of the pregnant woman. The method according to the embodiments of the present disclosure can be used to predict the probability of premature delivery, intrauterine growth retardation of a fetus at the gestational age in week at delivery, and other pregnancy complications associated with the concentration of fetal cell-free nucleic acids.
According to embodiments of the present disclosure, the gestational age in week at which the sampling is conducted is 13 to 25 weeks. The inventors found that there was a weak correlation between fetal concentration and premature delivery when the gestational age in week at which the blood sampling is conducted is 12 weeks or less or between 26 weeks and 30 weeks, while there was a strong correlation between fetal concentration and premature delivery when the gestational age in week at which the blood sampling is conducted is 13 to 25 weeks.
According to embodiments of the present disclosure, the prediction model is at least one of a linear regression model, a logistic regression model, or a random forest. According to the method of embodiments of the present disclosure, the prediction model may be theoretically any statistical model that generalizes different difference distributions.
According to embodiments of the present disclosure, the predetermined parameters further include a height, a body weight, and an age of the pregnant woman.
According to embodiments of the present disclosure, the step (iii) includes determining, by using the training set and the validation set, numerical values of β0, βicff, βisample, βiheight, βiweightβiage, and εi for the following formula: li = β0 + βicffxicff + βisamplexisample + βiheightxiheight + βiweightxiweight + βiagexiage + εi, where i = 1, ..., p , wherein i represents a serial number of a pregnant woman sample in the training set; li is a value determined for the known pregnancy status of the pregnant woman sample No.i, wherein li is 1 for the pregnant woman sample with premature delivery and li is 0 for the pregnant woman sample with full-term delivery; xicff represents the concentration of fetal cell-free nucleic acids of the pregnant woman sample No.i; xisample represents the gestational age in week at which the sampling for the peripheral blood of the pregnant woman sample No.i is conducted; xiheight represents the height of the pregnant woman sample No. i; xiweight represents the body weight for the pregnant woman sample No.i; xiage represents the age of the pregnant woman sample No.i, and ε i represents a sequencing error of the peripheral blood of the pregnant woman sample No.i.
In a second aspect of the present disclosure, provided is a system for constructing a prediction model for determining a pregnancy status of a pregnant woman according to embodiments of the present disclosure, including: a training set construction module configured to construct a training set composed of a plurality of pregnant woman samples each having a known pregnancy status; a predetermined parameter determination module connected to the training set construction module and configured to determine predetermined parameters of each pregnant woman sample in the training set, the predetermined parameters including a concentration of fetal cell-free nucleic acids in peripheral blood of the pregnant woman and a gestational age in week at which sampling for the peripheral blood of the pregnant woman is conducted; and a prediction model construction module connected to the predetermined parameter determination module and configured to construct the prediction model based on the known pregnancy status and the predetermined parameters. According to the embodiments of the present disclosure, the system constructs a prediction model for a pregnancy status of a pregnant woman based on the concentration of fetal cell-free DNA obtained via one-time blood sampling for a plurality of pregnant woman samples, the gestational age in week at which the sampling is conducted, the physical signs (such as height, body weight, body mass index, and age) of the pregnant woman when the sampling is conducted, and the pregnancy status (such as premature delivery and gestational age in week at delivery ) of the pregnant woman when the sampling is conducted, and the apparatus uses two key factors, the concentration of fetal cell-free DNA and the gestational age in week at which the sampling is conducted, as the key parameters for constructing the model, so that the accuracy of the constructed model is improved.
According to an embodiment of the present disclosure, the above-mentioned method may further have at least one of the following additional technical features:
According to an embodiment of the present disclosure, the pregnancy status includes a delivery interval of the pregnant woman. The system according to the embodiments of the present disclosure can be used to predict the probability of premature delivery, intrauterine growth retardation of a fetus at the gestational age in week at delivery, and other pregnancy complications associated with the concentration of fetal cell-free nucleic acids.
According to embodiments of the present disclosure, the gestational age in week at which sampling is conducted is 13 to 25 weeks. The inventors found that there was a weak correlation between fetal concentration and premature delivery when the gestational age in week at which the blood sampling is conducted is 12 weeks or less or between 26 weeks and 30 weeks, while there was a strong correlation between fetal concentration and premature delivery when the gestational age in week at which the blood sampling is conducted is 13 to 25 weeks.
According to embodiments of the present disclosure, the prediction model may be theoretically any statistical model that generalizes different difference distributions. According to a specific embodiment of the present disclosure, the prediction model is at least one of a linear regression model, a logistic regression model, or a random forest.
According to embodiments of the present disclosure, the predetermined parameters further include a height, a body weight, and an age of the pregnant woman.
According to embodiments of the present disclosure, the prediction model construction module is configured to determine, by using the training set and a validation set, numerical values of β0 , βicff, βisample, βiheight, βiweightβiage, and εi for the following formula: li = β0 + βicffxicff + βisamplexisample + βiheightxiheight + βiweightxiweight + βiagexiage + εi, where i = 1, ..., p, wherein i represents a serial number of the pregnant woman sample in the training set; li is a value determined for the known pregnancy status of the pregnant woman sample No.i, li is 1 for the pregnant woman sample with premature delivery, and li is 0 for the pregnant woman sample with full-term delivery; xicff represents the concentration of fetal cell-free nucleic acids of the pregnant woman sample No.i; xisample represents the gestational age in week at which the sampling for the peripheral blood of the pregnant woman sample No. i is conducted; xiheight represents the height of the pregnant woman sample No.i; xiweight represents the body weight of the pregnant woman sample No.i; xiage represents the age of the pregnant woman sample No.i; and εi represents a sequencing error of the peripheral blood of the pregnant woman sample No.i.
In a third aspect of the present disclosure, provided is a method for determining a pregnancy status of a pregnant woman. According to embodiments of the present disclosure, the method includes: (1) determining predetermined parameters of the pregnant woman, the predetermined parameters including a concentration of fetal cell-free nucleic acids in peripheral blood of the pregnant woman and a gestational age in week at which sampling for the peripheral blood of the pregnant woman is conducted; and (2) determining the pregnancy status of the pregnant woman based on the predetermined parameters and the prediction model constructed according to the method for constructing the prediction model. The method according to the embodiments of the present disclosure can quickly and accurately predict the pregnancy status of the pregnant woman based on information about the concentration of fetal cell-free nucleic acids in the peripheral blood of the pregnant woman obtained via one-time blood sampling at early pregnancy, the gestational age in week at which the sampling for the peripheral blood is conducted, and the physical sign data of the pregnant woman, the pregnancy status including the gestational age in week at delivery, the probability of premature delivery, the intrauterine growth retardation of the fetus, and other pregnancy complications associated with the concentration of fetal cell-free nucleic acids.
According to an embodiment of the present disclosure, the above-mentioned method may further have at least one of the following additional technical features:
According to embodiments of the present disclosure, the pregnancy status includes a delivery interval of the pregnant woman. The delivery interval refers to the gestational age in week at delivery. The method according to the embodiments of the present disclosure can effectively predict the gestational age in week at delivery and the probability of premature delivery of a pregnant woman. In addition, the method according to the embodiments of the present disclosure can also effectively predict pregnancy complications associated with the concentration of fetal cell-free nucleic acids, such as the probability of premature delivery and intrauterine growth retardation of a fetus at the gestational age in week at delivery.
According to embodiments of the present disclosure, the gestational age in week at which the sampling is conducted is 13 to 25 weeks. The inventors found that there was a weak correlation between fetal concentration and premature delivery when the gestational age in week at which the blood sampling is conducted is 12 weeks or less or between 26 weeks and 30 weeks, while there was a strong correlation between fetal concentration and premature delivery when the gestational age in week at which the blood sampling is conducted is 13 to 25 weeks.
According to embodiments of the present disclosure, the prediction model may be theoretically any statistical model that generalizes different difference distributions. According to a specific embodiment of the present disclosure, the predetermined prediction model is at least one of a linear regression model, a logistic regression model, or a random forest.
According to embodiments of the present disclosure, the predetermined parameters further include a height, a body weight, and/or an age of the pregnant woman, and the prediction model is adapted to calculate the delivery interval of the pregnant woman based on the following formula: l = β0 + βcffxcff + βsamplexsample + βheightxheight + βweightxweight + βagexage + ε, wherein l is a parameter determined based on a probability of premature delivery of the pregnant woman; β0, βcff, βsample, βheight, βweight, and ε are each independently a predetermined coefficient; xcff is the concentration of fetal cell-free nucleic acids of the pregnant woman; xsample is the gestational age in week at which the sampling for the maternal peripheral blood of the pregnant woman is conducted; xheight is the height of the pregnant woman; xweight is the body weight of the pregnant woman; xage is the age of the pregnant woman, and εi is a sequencing error of a peripheral blood sample of the pregnant woman. According to the embodiments of the present disclosure, the coefficients β0, βcff, βsample, βheight, and βweight may be obtained based on a predetermined training set, one or several of which may be selected, and the pregnant woman’s body mass index (BMI) may be added as one of the coefficients.
According to embodiments of the present disclosure, l is determined based on the following formula:
,where b is a base number of log and is generally a constant e, and p is the probability of premature delivery of the pregnant woman.
In a fourth aspect of the present disclosure, provided is an apparatus for determining a pregnancy status of a pregnant woman. According to embodiments of the present disclosure, the apparatus includes: a parameter determination module configured to determine predetermined parameters of the pregnant woman, the predetermined parameters including a concentration of fetal cell-free nucleic acids in peripheral blood of the pregnant woman and a gestational age in week at which sampling for the peripheral blood of the pregnant woman is conducted; and a pregnancy status determination module connected to the parameter determination module and configured to determine the pregnancy status of the pregnant woman based on the predetermined parameters and the prediction model. The apparatus according to the embodiments of the present disclosure can quickly and accurately predict the pregnancy status of the pregnant woman based on the information about the concentration of fetal cell-free nucleic acids obtained via one-time blood sampling at early pregnancy of the pregnant woman, the gestational age in week at which the sampling for the peripheral blood is conducted, and the physical sign data of the pregnant woman, the pregnancy status including the gestational age in week at delivery, the probability of premature delivery, the intrauterine growth retardation of the fetus and other pregnancy complications associated with the concentration of fetal cell-free nucleic acids.
According to embodiments of the present disclosure, the above-mentioned apparatus may further have the following additional technical features:
According to embodiments of the present disclosure, the pregnancy status includes a delivery interval of the pregnant woman. The method according to the embodiments of the present disclosure can predict the probability of premature delivery, intrauterine growth retardation of the fetus at the gestational age in week at delivery, and other pregnancy complications associated with the concentration of fetal cell-free nucleic acids.
According to embodiments of the present disclosure, the gestational age in week at which the sampling is conducted is 13 to 25 weeks. The inventors found that there was a weak correlation between fetal concentration and premature delivery when the gestational age in week at which the blood sampling is conducted is 12 weeks or less or between 26 weeks and 30 weeks, while the gestational age in week at which the blood sampling is conducted is 13 to 25 weeks.
According to embodiments of the present disclosure, the predetermined prediction model is at least one of a linear regression model, a logistic regression model, or a random forest. According to a specific embodiment of the present disclosure, the prediction model may be theoretically any statistical model that generalizes different difference distributions.
According to embodiments of the present disclosure, the predetermined parameters further include a height, a body weight, and an age of the pregnant woman, and the prediction model is adapted to calculate a delivery interval of the pregnant woman based on the following formula:
wherein l is a parameter determined based on the probability of premature delivery of the pregnant woman;
and ε are each independently a predetermined coefficient; xcff is the concentration of fetal cell-free nucleic acids of the pregnant woman; xsample is the gestational age in week at which the sampling for the peripheral blood of the pregnant woman is conducted; xheight is the height of the pregnant woman; xweight is the body weight of the pregnant woman; xage is the age of the pregnant woman, and ε is a sequencing error of a peripheral blood sample of the pregnant woman. According to embodiments of the present disclosure, the coefficients β0, βcff, βsample, βheight, and βweight may be freely selected as needed, for example, the pregnant woman BMI may be additionally added as one of the coefficients.
According to embodiments of the present disclosure, l is determined based on the following formula:
wherein b is a base number of log and is generally a constant e, and p is the probability of premature delivery of the pregnant woman.
In a fifth aspect of the present disclosure, provided is a computer-readable storage medium having a computer program stored thereon. The program, when executed by a processor, implements the steps of the above-described method for constructing the prediction model. Thus, the above-described method for constructing the prediction model can be effectively implemented, so that the prediction model can be effectively constructed, and the prediction model can be then used to perform prediction on an unknown sample to determine the pregnancy status of the pregnant woman to be detected.
In a sixth aspect of the present disclosure, provided is an electronic device including a computer-readable storage medium as described above; and one or more processors configured to execute the program in the computer-readable storage medium.
The foregoing and/or additional aspects and advantages of the present disclosure will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:
Embodiments of the present disclosure will be described in detail below, examples of which are illustrated in the accompanying drawings. The examples described below with reference to the accompanying drawings are illustrative, which are merely intended to explain the present disclosure, rather than to limit the present disclosure.
As used herein, the terms “first”, “second”, “third”, and other similar terms, unless specifically stated otherwise, are used for descriptive purposes to distinguish one from another and are not intended to imply or express any differences in order or importance, and it is not intended to mean that a content defined by terms such as “first”, “second”, “third” and the like consists of only one element.
In the present disclosure, unless otherwise clearly specified and limited, the terms “installation”, “interconnection”, “connection” and “fixation” etc. are intended to be understood in a broad sense, for example, it may be a fixed connection, removable connection or integral connection; may be a mechanical connection or electrical connection; may be a direct connection or indirect connection using an intermediate; and may be a communication within two elements or an interaction relationship between the two elements, unless explicitly limited otherwise. A person of ordinary skill in the art can understand specific meanings of these terms in the present disclosure based on specific situations.
According to one aspect of the present disclosure, a method for constructing a prediction model is provided. According to an embodiment of the present disclosure, referring to
S1000, constructing a training set and a selective validation set, each of the training set and the validation set being composed of a plurality of pregnant woman samples each having a known pregnancy status;
S2000, determining predetermined parameters of each pregnant woman sample in the training set, the predetermined parameters including a concentration of fetal cell-free nucleic acids in peripheral blood of the pregnant woman and a gestational age in week at which sampling for the peripheral blood of the pregnant woman is conducted; and
S3000, constructing the prediction model based on the known pregnancy status and the predetermined parameters. The method according to the embodiment of the present disclosure constructs a prediction model for the pregnancy status of the pregnant woman based on the concentration of fetal cell-free nucleic acids obtained via one-time blood sampling for a plurality of pregnant woman samples, the gestational age in week at which the sampling is conducted, the physical signs (such as height, body weight, BMI, and age) of the pregnant woman when the sampling is conducted, and the pregnancy status (such as premature delivery and gestational age in week at delivery) of the pregnant woman when the sampling is conducted, and the method includes two key factors, the concentration of fetal cell-free nucleic acids and the sampling gestational age in week, so that the accuracy of the model is improved. According to an embodiment of the present disclosure, the concentration of fetal cell-free nucleic acids is obtained by data processing using sequencing data of the cell-free nucleic acids in the plasma of a pregnant woman as input data, and specifically includes: after the quality control of raw sequencing data (fq format) is finished, aligning the sequencing data to human reference chromosomes by using alignment software (such as a samse mode in BWA); using sequencing data quality control software (such as Picard) to remove the repeated reads in the alignment results and calculate the repetition rate; completing the local correction of the alignment results by using mutation detection algorithm (such as Base Quality Score Recalibration BQSR function in GATK); and calculating the average depth of different chromosomes in each sample by using coverage depth calculation software (such as Depth of Coverage function in GATK). For male fetus samples, the mean depth of coverage of the unique alignment reads matching the non-homologous region of Y chromosome is calculated, and the ratio of this mean depth to the mean depth of the unique alignment reads matching autosome is the concentration of fetal cell-free nucleic acids. For female fetus samples, calculation can be performed using existing methods for calculating the concentration of fetal cell-free nucleic acids based on low-depth sequencing data of maternal plasma.
According to a specific embodiment of the present disclosure, in the method of the present disclosure, pregnant woman samples are selected as a training set and a validation set, a prediction model is constructed based on the known pregnancy status, concentration of fetal cell-free nucleic acids, height, body weight, age, BMI, and gestational age in week at which blood sampling is conducted (13 to 25 weeks) in the training set, and the magnitude of each fixed coefficient in the prediction model formula is then determined, so as to predict the pregnancy status of the pregnant woman to be detected.
According to an embodiment of the present disclosure, the pregnancy status includes a delivery interval of the pregnant woman. The method according to the embodiment of the present disclosure can be used to predict the probability of premature delivery, intrauterine growth retardation of a fetus at the gestational age in week at delivery, and other pregnancy complications associated with the concentration of fetal cell-free nucleic acids.
According to an embodiment of the present disclosure, the gestational age in week at which the sampling is conducted is 13 to 25 weeks. The inventors found that there was a weak correlation between fetal cell-free nucleic acid concentration and premature delivery when the gestational age in week at which the blood sampling is conducted is 12 weeks or less or between 26 weeks and 30 weeks, while there was a strong correlation between fetal concentration and premature delivery when the gestational age in week at which the blood sampling is conducted is 13 to 25 weeks. Generally, there is a problem of weak correlation in the prediction of the pregnancy status of pregnant women using the concentration of fetal cell-free nucleic acids. According to the method of the embodiment of the present disclosure, the gestational age in week at which sampling is conducted is added as one of the parameters for constructing the prediction model, which improves the accuracy of prediction. Different pregnant woman samples can be used as model construction samples only with one-time blood sampling within a gestational age of 13 to 25 weeks, avoiding the risk and cost of repeated blood samplings for pregnant woman samples in the process of sample collection.
According to an embodiment of the present disclosure, the prediction model is at least one of a linear regression model, a logistic regression model, or a random forest. According to an embodiment of the present disclosure, the prediction model may be theoretically any statistical model that generalizes different difference distributions.
According to an embodiment of the present disclosure, the predetermined parameters further include a height, a body weight, and an age of the pregnant woman.
According to an embodiment of the present disclosure, the step (iii) includes determining, by using the training set and the validation set, numerical values of β0 ,
for the following formula:
where i = 1,..., p, wherein i represents a serial number of the pregnant woman sample in the training set; li is a value determined for the known pregnancy status of the pregnant woman sample No.i, wherein li is 1 for the pregnant woman sample with premature delivery and li is 0 for the pregnant woman sample with full-term delivery; xicff represents the concentration of fetal cell-free nucleic acids of the pregnant woman sample No.i; xisample represents the gestational age in week at which the sampling for the peripheral blood of the pregnant woman sample No.i is conducted; xiheight represents the height of the pregnant woman sample No. i; xiweight represents the body weight for the pregnant woman sample No.i; xiage represents the age of the pregnant woman sample No.i; and εirepresents a sequencing error of the peripheral blood of the pregnant woman sample No.i. It should be noted that ε is the random error generated by the sequencer during the sequencing process, and this value is associated with the sequencing batch but independent of the pregnant woman sample, and will be directly generated by the sequencer when downloading the sequencing data from the sequencer.
According to a second aspect of the present disclosure, a system for constructing a prediction model is provided. According to an embodiment of the present disclosure, the prediction model is used to determine a pregnancy status of a pregnant woman, and with reference to
According to an embodiment of the present disclosure, the pregnancy status includes a delivery interval of the pregnant woman. The method according to the embodiment of the present disclosure can be used to predict the probability of premature delivery, intrauterine growth retardation of a fetus at the gestational age in week at delivery, and other pregnancy complications associated with the concentration of fetal cell-free nucleic acids.
According to an embodiment of the present disclosure, the gestational age in week at which the sampling is conducted is 13 to 25 weeks. The inventors found that there was a weak correlation between fetal concentration and premature delivery when the gestational age in week at which the blood sampling is conducted is 12 weeks or less or between 26 weeks and 30 weeks, while there was a strong correlation between fetal concentration and premature delivery when the gestational age in week at which the blood sampling is conducted is 13 to 25 weeks. Generally, there is a problem of weak correlation in the prediction of the pregnancy status of pregnant women using the concentration of fetal cell-free nucleic acids. According to the system of the embodiment of the present disclosure, the gestational age in week at which the sampling is conducted is added as one of the parameters for constructing the prediction model, which improves the accuracy of prediction. Different pregnant woman samples can be used as model construction samples only with one-time blood sampling within the gestational age of 13 to 25 weeks, avoiding the risk and cost of repeated blood samplings for pregnant woman samples in the process of sample collection.
According to an embodiment of the present disclosure, the prediction model is at least one of a linear regression model, a logistic regression model, or a random forest. In the system according to an embodiment of the present disclosure, the prediction model may be theoretically any statistical model that generalizes different difference distributions.
According to an embodiment of the present disclosure, the predetermined parameters further include a height, a body weight, and an age of the pregnant woman.
According to an embodiment of the present disclosure, the prediction model construction module is configured to determine, by using the training set and a validation set, numerical values of
for the following formula:
wherein i represents a serial number of the pregnant woman sample in the training set; li is a value determined for the known pregnancy status of the pregnant woman sample No. i, wherein li is 1 for the pregnant woman sample with premature delivery and li is 0 for the pregnant woman sample with full-term delivery; xicff represents the concentration of fetal cell-free nucleic acids of the pregnant woman sample No.i; xisample represents the gestational age in week at which the sampling for the peripheral blood of the pregnant woman sample No.i is conducted; xiheight represents the height of the pregnant woman sample No.i; xiweight represents the body weight for the pregnant woman sample No.i; xiage represents the age of the pregnant woman sample No.i; and εi represents a sequencing error of the peripheral blood of the pregnant woman sample No.i.
In a third aspect, the present disclosure provides a method for determining a pregnancy status of a pregnant woman. According to an embodiment of the present disclosure, referring to
S100, determining predetermined parameters of the pregnant woman, the predetermined parameters including a concentration of fetal cell-free nucleic acids in peripheral blood of the pregnant woman and a gestational age in week at which sampling for the peripheral blood of the pregnant woman is conducted; and
S200, determining the pregnancy status of the pregnant woman based on the predetermined parameters and the prediction model. According to the method of an embodiment of the present disclosure, the concentration of fetal cell-free nucleic acids is obtained by data processing using sequencing data of the cell-free nucleic acids in the plasma of the pregnant woman as input data, specifically including: after the quality control of raw sequencing data (fq format) is finished, aligning the sequencing data to human reference chromosomes by using alignment software (such as a samse mode in BWA); using sequencing data quality control software (such as Picard) to remove the repeated reads in the alignment results and calculate the repetition rate; completing the local correction of the alignment results by using mutation detection algorithm (such as Base Quality Score Recalibration BQSR function in GATK); and calculating the average depth of different chromosomes in each sample by using coverage depth calculation software (such as Depth of Coverage function in GATK). For male fetus samples, the mean depth of coverage of the unique alignment reads matching the non-homologous region of Y chromosome is calculated, and the ratio of this mean depth to the mean depth of the unique alignment reads matching autosome is the concentration of fetal cell-free nucleic acids. For female fetus samples, calculation can be performed using existing methods for calculating the concentration of fetal cell-free nucleic acids based on low-depth sequencing data of maternal plasma.
According to an embodiment of the present disclosure, the pregnancy status includes a delivery interval of the pregnant woman. The method according to the embodiment of the present disclosure can be used to predict the probability of premature delivery, intrauterine growth retardation of a fetus at the gestational age in week at delivery, and other pregnancy complications associated with the concentration of fetal cell-free nucleic acids.
According to an embodiment of the present disclosure, the gestational age in week at which the sampling is conducted is 13 to 25 weeks. The inventors found that there was a weak correlation between fetal concentration and premature delivery when the gestational age in week at which the blood sampling is conducted is 12 weeks or less or between 26 weeks and 30 weeks, while there was a strong correlation between fetal concentration and premature delivery when the gestational age in week at which the blood sampling is conducted is 13 to 25 weeks. Generally, there is a problem of weak correlation in the prediction of the pregnancy status of pregnant women using the concentration of fetal cell-free nucleic acids. According to the method of the embodiment of the present disclosure, the gestational age in week at which the sampling is conducted is added as one of the parameters for constructing the prediction model, which improves the accuracy of prediction, and blood sampling of the pregnant women only need to be conducted once within the gestational age of 13 to 25 weeks, which reduces the cost and risk of multiple blood samplings.
According to an embodiment of the present disclosure, the predetermined prediction model is at least one of a linear regression model, a logistic regression model, or a random forest. According to an embodiment of the present disclosure, the prediction model may be theoretically any statistical model that generalizes different difference distributions.
According to a specific embodiment of the present disclosure, the method of the present disclosure constructs a prediction model based on the known pregnancy status, concentration of fetal cell-free nucleic acids, height, body weight, age, BMI, and gestational age in week (13 to 25 weeks) at which blood sampling is conducted, and determines the magnitude of each fixed coefficient in the prediction model formula, so as to predict the pregnancy status of the pregnant woman to be detected. At the gestational age of 13 to 25 weeks, the peripheral blood of the pregnant woman to be tested is collected to detect the concentration of fetal cell-free nucleic acids, and the information about the concentration of fetal cell-free nucleic acids, height, body weight, age, BMI, and gestational age in week of the pregnant woman are input to the prediction model, so as to obtain prediction information of the pregnancy status of the pregnant woman to be tested.
According to a specific embodiment of the present disclosure, the predetermined parameters further include a height, a body weight, and an age of the pregnant woman, and the prediction model is adapted to calculate a delivery interval of the pregnant woman based on the following formula:
wherein l is a parameter determined based on the probability of premature delivery of the pregnant woman;
and ε are each independently a predetermined coefficient; xcff is the concentration of fetal cell-free nucleic acids of the pregnant woman; xsample is the gestational age in week at which the sampling for the peripheral blood of the pregnant woman is conducted; xheight is the height of the pregnant woman; xweight is the body weight of the pregnant woman; xage is the age of the pregnant woman, and εi is a sequencing error of a peripheral blood sample of the pregnant woman. According to the method of an embodiment of the present disclosure, the coefficients
may be freely selected as needed, for example, the pregnant woman BMI may be additionally added as one of the coefficients.
According to an embodiment of the present disclosure, l is determined based on the following formula:
wherein b is a base number of log and is generally a constant e, and p is the probability of premature delivery of the pregnant woman.
In a fourth aspect of the present disclosure, the present disclosure provides an apparatus for determining a pregnancy status of a pregnant woman, and according to an embodiment of the present disclosure, with reference to
According to an embodiment of the present disclosure, the pregnancy status includes a delivery interval of the pregnant woman. The apparatus according to the embodiment of the present disclosure can be used to predict the probability of premature delivery, intrauterine growth retardation of a fetus at the gestational age in week at delivery, and other pregnancy complications associated with the concentration of fetal cell-free nucleic acids.
According to an embodiment of the present disclosure, the gestational age in week at which the sampling is conducted is 13 to 25 weeks. The inventors found that there was a weak correlation between fetal concentration and premature delivery when the gestational age in week at which the blood sampling is conducted is 12 weeks or less or between 26 weeks and 30 weeks, while there was a strong correlation between fetal concentration and premature delivery when the gestational age in week at which the blood sampling is conducted is 13 to 25 weeks. Generally, there is a problem of weak correlation in the prediction of the pregnancy status of pregnant women using the concentration of fetal cell-free nucleic acids. According to the apparatus of the embodiment of the present disclosure, the gestational age in week at which the sampling is conducted is added as one of the parameters for constructing the prediction model, which improves the accuracy of prediction, and blood sampling of the pregnant women only needs to be conducted once within the gestational age of 13 to 25 weeks, which reduces the cost and risk of multiple blood samplings.
According to an embodiment of the present disclosure, the predetermined prediction model is at least one of a linear regression model, a logistic regression model, or a random forest. According to the apparatus of an embodiment of the present disclosure, the prediction model may be theoretically any statistical model that generalizes different difference distributions.
According to a specific embodiment of the present disclosure, the predetermined parameters further include a height, a body weight, and an age of the pregnant woman, and the prediction model is adapted to calculate a delivery interval of the pregnant woman based on the following formula:
wherein l is a parameter determined based on the probability of premature delivery of the pregnant woman;
each independently a predetermined coefficient; xcff is the concentration of fetal cell-free nucleic acids of the pregnant woman; xsample is the gestational age in week at which the sampling for the peripheral blood of the pregnant woman is conducted; xheight is the height of the pregnant woman; xweight is the body weight of the pregnant woman; xage is the age of the pregnant woman, and ε is a sequencing error of a peripheral blood sample of the pregnant woman. According to an embodiment of the present disclosure, the coefficients β0, βcff,
may be freely selected as needed, for example, the pregnant woman BMI may be additionally added as one of the coefficients.
According to an embodiment of the present disclosure, l is determined based on the following formula:
where b is a base number of log and is generally a constant e, and p is the probability of premature delivery of the pregnant woman.
In a fifth aspect of the present disclosure, provided is a computer-readable storage medium having a computer program stored thereon. The program, when executed by a processor, implements the steps of the above-described method for constructing the prediction model. Thus, the above-described method for constructing the prediction model can be effectively implemented, so that the prediction model can be effectively constructed, and the prediction model can be then used to perform prediction on an unknown sample to determine the pregnancy status of the pregnant woman to be detected.
In a sixth aspect of the present disclosure, provided is an electronic device including: the computer-readable storage medium; and one or more processors configured to execute the program in the computer-readable storage medium.
The present disclosure will be further explained below with reference to specific examples. The experimental methods applied in the following examples are conventional methods, unless otherwise specified. The materials, reagents, etc. used in the following examples are all commercially available, unless otherwise specified.
The technical solutions of the present disclosure will be explained below with reference to examples. Those skilled in the art will understand that these examples are illustrative only, and should not be considered as limiting the scope of the present disclosure. Examples, where specific techniques or conditions are not specified, are implemented in accordance with techniques or conditions described in the literature in the art (for example, refer to J. Sambrook et al. “Molecular Cloning: A Laboratory Manual” translated by Huang Peitang et al., 3rd edition, Science Press) or according to the product specification. All of the used reagents or instruments which are not specified with the manufacturer are conventional commercially-available products, for example, purchased from Illumina.
38964 samples were classified according to different gestational ages in week at which blood sampling was conducted, and the correlation between the concentration of fetal cfDNAs in plasma and the premature delivery was calculated respectively. With reference to
Plasma cfDNA data of 38964 pregnant women in combination with the gestational age in week at which the blood sampling was conducted and the age, height, and body weight information of the pregnant woman served as a training set:
(1) A linear regression model was established with the gestational age in week at delivery as a continuous variable in the prediction of the gestational age in week at delivery.
Specifically, by taking the gestational age in week at delivery as Y value, and taking the fetal cfDNA concentration, the gestational age in week at which the blood sampling was conducted, and the height, body weight, age and BMI of pregnant women as covariates, a prediction model was established:
gestational age in week at delivery corresponding to sample i, xicff is the fetal cfDNA concentration corresponding to sample i, xisample is the gestational age in week at which the blood sampling is conducted, corresponding to sample i, xiheight is the height of the pregnant woman corresponding to sample i, xiweight is the body weight of the pregnant woman corresponding to sample i, xiage is the age of the pregnant woman corresponding to sample i, xibmi is the BMI of the pregnant woman corresponding to sample i, and p is the total number of samples in the training set, where p = 38964.
The estimated values of coefficient β for different variables in the finally obtained prediction model are shown in the column of gestational age in week at delivery in Table 2.
(2) A logistic regression model was established by defining premature delivery events as Y = 0 and defining full-term delivery events as Y = 1 in the prediction of premature delivery.
Specifically, the probability of full-term delivery of a sample was set as p = P (Y = 1), the probability of premature delivery of the sample was set as p = P (Y = 0), and this probability p was subjected to log-odds transformation, i.e.,
where b is the base number of log and is generally a constant e.
The transformed l was put into the linear regression model, and the fetal cfDNA concentration, gestational age in week at which blood sampling was conducted, and height, body weight, and age of pregnant women were also taken as covariates to establish a prediction model.
Specifically, by taking the gestational age in week at delivery as Y value, and taking the fetal cfDNA concentration, the gestational age in week at which blood sampling was conducted, and the height, body weight, age, and BMI of the pregnant women as covariates, a prediction model was established:
is the logical transformation result of the gestational age in week at delivery corresponding to sample i, xicff is the fetal cfDNA concentration corresponding to sample i, xisample is the gestational age in week at which blood sampling was conducted, corresponding to sample i, xiheight is the height of the pregnant woman corresponding to sample i, xiweight is the body weight of the pregnant woman corresponding to sample i, xiage is the age of the pregnant woman corresponding to sample i, xibmi is the BMI of the pregnant woman corresponding to sample i, and p is the total number of samples in the training set, where p = 38964.
The estimated values of coefficient β for various variables in the finally obtained prediction model are shown in the column of premature delivery in Table 1.
After obtaining the prediction models for premature delivery and gestational age in week at delivery, additional 32049 samples were used as a test set, the fetal concentration, gestational age in week at which blood sampling was conducted, and age, height, body weight and BMI of pregnant woman corresponding to each sample were respectively put into the linear regression model to predict the gestational age in week at delivery and into the logistic regression model to predict premature delivery.
Refer to
In addition, reference to the term “an embodiment”, “some embodiments”, “an example”, “a specific example” or “some examples” or the like means that a specific feature, structure, material, or characteristic described in combination with the example(s) or example(s) is included in at least one embodiment or example of the present disclosure. In this specification, illustrative expressions of these terms do not necessarily refer to the same embodiment or example. Moreover, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. In addition, without mutual contradiction, those skilled in the art may combine different embodiments or examples and features of the different embodiments or examples described in this specification.
Although the embodiment or examples of the present disclosure have been illustrated and described above, it should be understood that the embodiments or examples are illustrative and should not be construed as limiting the present disclosure, and persons of ordinary skill in the art may make various changes, modifications, replacements and variations to the above embodiments or examples within the scope of the present disclosure.
This application is a continuation of International Application No. PCT/CN2020/094394, filed on Jun. 4, 2020, the entire disclosure of which is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2020/094394 | Jun 2020 | WO |
Child | 18061264 | US |