This application claims priority under 35 U.S.C. § 119 to Korean Patent Application Nos. 10-2021-0030288, filed on Mar. 8, 2021, and 10-2021-0081013, Jun. 22, 2021, the disclosure of which is incorporated herein by reference in its entirety.
The present invention relates to a device and method of predicting disease by using elderly cohort data, and more particularly, to a device and method of predicting disease by using elderly cohort data and an elderly disease prediction model applied thereto, which may predict an outbreak possibility of an elderly disease including cerebral stroke by using cohort data of 60 or more-year-old persons.
Based on the statistics of cause of death provided from National Statistical Office in 2018, it has been reported that the total number of dead persons is 298,820, the number of dead men is 161,187, and the number of dead women is 137,633. Based on each cause of death in the statistics, it has been reported that malignant neoplasm (cancer) is 79,153 in number of patients, a heart disease is 32,004 in number of patients, pneumonia is 23,280 in number of patients, and a cerebrovascular disease is 22,940 in number of patients. Here, in a 60 or more-year-old patient group, a death rate caused by a heart disease and a cerebrovascular disease included in a circulatory disease is progressively increasing.
An elderly disease including the heat disease and the cerebrovascular disease has various symptoms and is variously classified, and due to this, is difficult to reliably evaluate a disorder caused by a corresponding symptom and a neurological damage accompanied thereby. Also, in patients having a past outbreak history, a possibility to re-outbreak is high, and thus, it is desperately required to develop technology which help to continuously trace and observe target persons to enable a patient to be diagnosed and cured at an appropriate time.
For example, cerebral stroke is one of main diseases which cause a function disorder of adults and elderly persons and is one of fatal diseases which cause difficulty in social or economic activities, on the basis of the degree of disorder. The cerebral stroke may variously occur based on the degree of disorder of patients or an accompanies disease, and thus, a current disorder level should be accurately evaluated and a risk factor should be continuously managed for each person.
In National Institutes of Health, national institutes of health stroke scale (NIHSS), which is widely used in quantitative measurement on a disorder after the outbreak of cerebral stroke, is globally and widely being used as an indicator where reliability and validity between inspection and re-inspection have been verified. The NIHSS is being widely used to overall evaluate a disorder of each cerebral stroke patient, but has a drawback which it is unable to provide an accurate prediction information result for evaluating an initial disorder.
Accordingly, the present invention provides a device and method of predicting disease by using elderly cohort data and an elderly disease prediction model applied thereto, which analyze cohort data of an elderly group defined as 60 or more-year-old persons by using a prediction model based on a convolution neural network (CNN) to predict the outbreak of an elderly disease, thereby providing objective diagnosis and a cure for elderly diseases.
The objects of the present invention are not limited to the aforesaid, but other objects not described herein will be clearly understood by those skilled in the art from descriptions below.
In one general aspect, a method of predicting disease by using elderly cohort data includes: collecting cohort data of an elderly group; preprocessing the collected cohort data; extracting an attribute in the collected cohort data and selecting a subset corresponding to the extracted attribute; and analyzing a degree of risk of a disease on the basis of the selected attribute set by using a disease prediction model.
In another general aspect, a device for predicting disease by using elderly cohort data includes: a data collector configured to collect cohort data of an elderly group; a data preprocessor configured to preprocess the collected cohort data; a subset selector configured to extract an attribute in the collected cohort data and select a subset corresponding to the extracted attribute; and a disease analyzer configured to analyze a degree of risk of a disease on the basis of the selected attribute set by using a disease prediction model.
A computer program according to another embodiment of the present invention for solving the above-described problem may be coupled to a computer which is hardware, may execute a method of predicting disease by using elderly cohort data, and may be stored in a computer-readable recording medium.
Other features and aspects will be apparent from the following detailed description, the drawings, and the claims.
Embodiments of the present invention are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the present invention to one of ordinary skill in the art. Since the present invention may have diverse modified embodiments, preferred embodiments are illustrated in the drawings and are described in the detailed description of the present invention. However, this does not limit the present invention within specific embodiments and it should be understood that the present invention covers all the modifications, equivalents, and replacements within the idea and technical scope of the present invention. In describing the present invention, a detailed description of known techniques associated with the present invention unnecessarily obscure the gist of the present invention, it is determined that the detailed description thereof will be omitted.
Moreover, each of terms such as “ . . . part”, “ . . . unit”, and “module” described in specification denotes an element for performing at least one function or operation, and may be implemented in hardware, software or the combination of hardware and software.
In the following description, the technical terms are used only for explain a specific exemplary embodiment while not limiting the present invention. The terms of a singular form may include plural forms unless referred to the contrary. The meaning of ‘comprise’, ‘include’, or ‘have’ specifies a property, a region, a fixed number, a step, a process, an element and/or a component but does not exclude other properties, regions, fixed numbers, steps, processes, elements and/or components.
Referring to
The data collector 110 may collect cohort data of an elderly group.
Here, the cohort data of the elderly group may be collected in a research database which is built for research support for elderly persons such as prognosis analysis and risk factors of elderly diseases, and for example, a cohort database provided from institution such as National Health Insurance Service may correspond thereto. Also, the cohort data of the elderly group may include social and economic information, disorder and death information, medical use information include cure and health information, medical cure institution situation information, long-term elderly care service application, and use information, which include medical treatment and medical checkup.
In an embodiment, the data collector 110 may periodically update the cohort data stored in the database, and thus, may allow the disease prediction model to previously learn the updated cohort data. In detail, the disease prediction device 100 according to the present invention may calculate an outbreak rate (a risk degree) of a disease to be analyzed by using the disease prediction model receiving the cohort data, and thus, it may be needed to periodically update the database storing the cohort data.
The data preprocessor 120 may preprocess the collected cohort data.
In detail, in order to perform classification or prediction based on machine learning and deep learning, it may be needed to perform a preprocessing operation on raw data where a possibility of including pieces of repeated data, which are not complete and are inconsistent, is high. In the present invention, a preprocessing operation may be performed on the cohort data so as to improve and enhance the performance and accuracy of the disease prediction model.
In an embodiment, the data preprocessor 120 may remove a repeated tuple and a noise tuple in each data table included in the cohort data and may convert and normalize a data format so as to enable analysis through the disease prediction model. Here, the tuple may denote a record or a row in the data table.
Moreover, the data preprocessor 120 may generate a main data table associated with a disease which is to be predicted and may construct a data mart including a data table associated with a main disease code of the disease which is to be predicted, on the basis of joining of the generated main data tables.
When a prediction target disease according to the present invention is cerebral stroke, as in
The subset selector 130 may extract an attribute in the collected cohort data and may select a subset corresponding to the extracted attribute. For example, when a prediction target disease according to the present invention is cerebral stroke, the subset selector 130 may extract total 64 attributes in the cohort data. Here, the extracted attribute may include a continuity attribute, including a body mass index, proteinuria, total cholesterol level, serum creatinine level, and gamma GPT level, and a discrete attribute including daily drinking amount, smoking, the presence of hepatitis B antigen (HBeAg), and high-strength physical activity.
In an embodiment, the subset selector 130 may perform Z-score normalization based on the following Equation 1 on the attribute extracted from the collected cohort data.
Here, may denote each attribute, σ may denote a standard deviation of x, μ may denote an average of x, and α may denote a weight value.
Such a normalization process may convert data so that corresponding data is within a small range of 0.0 to 1.0, and thus, each attribute may have the same weight value. Therefore, like serum creatinine level in the extracted attribute, a range of a value may be wide, and thus, a case where the value depends on a measurement unit may be prevented.
Moreover, the subset selector 130 may calculate and select a subset where a probability distribution calculated in a case which uses all attributes extracted from the cohort data and a similar probability distribution are calculated, in performing data classification. Here, in order to calculate and select the subset, the subset selector 130 may use Hall's theorem. In detail, an entropy corresponding to Y including a best first search value and an attribute value and a condition probability based on Pearson's correlation coefficient between attributes and a target class may be calculated by using Hall's theorem. Also, the entropy corresponding to an arbitrary attribute Y may be calculated as the following Equation 2, in order to obtain an information profit of each attribute.
Moreover, the subset selector 130 may evaluate a subset, where a largest value is calculated as a result of the calculation based on the following Equation 3, as a subset where an expression rate of all attributes is highest, and the disease prediction model may be analyzed by using a subset evaluated as a subset where an expression rate is highest. The following Equation 3 may represent a merit function for evaluating the degree to which all attributes of each subset (Fa⊂F) are efficiently expressed.
Here, Fs may denote a subset, k may denote the number of attributes of Fs,
The disease analyzer 140 may analyze the degree of risk of a disease by using an attribute set selected through the disease prediction model. Here, the disease prediction model may be constructed as a disease prediction model based on a 1D CNN. Hereinafter, a detailed structure of the disease prediction model and an analysis result based thereon will be described.
Referring to
Moreover, referring to
Moreover, a softmax layer which evaluates a probability value associated with target disease prediction may be disposed at a final position of the hidden layer. For example, when the prediction target disease is cerebral stroke, the softmax layer may classify elderly persons having cerebral stroke and normal elderly persons and may classify elderly persons where an evaluated probability value is large.
Moreover, a rectified linear unit (ReLU) activation function may be used between each convolution layer and pooling layer of the disease prediction model, and a batch normalization process may be applied. Here, the ReLU activation function may be a function where a value less than 0 is returned as 0 and a value greater than 0 is returned as-is and may prevent slope disappearance which occurs when parameters are determined by adding the batch normalization process.
Cohort data including data of 38,669 elderly persons having cerebral stroke and data of 38,669 normal elderly persons randomly extracted may be used for verifying the performance of the disease prediction apparatus 100 according to the present invention, and an experiment has been performed based on a data set of total 77,338 persons. In two kinds of experiments performed, 10-fold cross-validation has been applied, an optimizer has been applied to Adam, and hyper parameter tuning such as a learning rate and a performance number has been performed through changing as shown in a table of
Referring to
Referring to
Referring to
Subsequently, the cohort data may be preprocessed, and thus, may be processed into a format capable of being applied to a disease prediction model. Here, the data preprocessor 120 may remove a repeated tuple and a noise tuple in each data table included in the cohort data and may convert and normalize a data format so as to enable analysis through the disease prediction model in step S702.
Subsequently, the process may extract an attribute in the collected cohort data and may select a subset corresponding to the extracted attribute. Here, the extracted attribute may include a continuity attribute, including a body mass index, proteinuria, total cholesterol level, serum creatinine level, and gamma GPT level, and a discrete attribute including daily drinking amount, smoking, the presence of hepatitis B antigen (HBeAg), and high-strength physical activity in step S703.
Subsequently, the process may analyze the degree of risk of a target disease by using the selected subset. The degree of risk of the target disease may be determined based on a disease outbreak rate calculation result of the disease prediction model, and the disease prediction model may be constructed based on the 1D CNN in step S704.
In the above description, according to an implementation embodiment of the present invention, steps S701 to S704 may be further divided into additional steps, or may be combined as fewer steps. Also, some steps may be omitted depending on the case, and a sequence between steps may be changed. Furthermore, despite the other omitted content, the descriptions of
An embodiment of the present invention described above may be implemented as a program (or an application) and may be stored in a medium, so as to be executed in connection with a server which is hardware.
The above-described program may include a code encoded as a computer language such as C, C++, JAVA, or machine language readable by a processor (CPU) of a computer through a device interface of the computer, so that the computer reads the program and executes the methods implemented as the program. Such a code may include a functional code associated with a function defining functions needed for executing the methods, and moreover, may include an execution procedure-related control code needed for executing the functions by using the processor of the computer on the basis of a predetermined procedure. Also, the code may further include additional information, needed for executing the functions by using the processor of the computer, or a memory reference-related code corresponding to a location (an address) of an internal or external memory of the computer, which is to be referred to by a media. Also, when the processor needs communication with a remote computer or server so as to execute the functions, the code may further include a communication-related code corresponding to a communication scheme needed for communication with the remote computer or server and information or a media to be transmitted or received in performing communication, by using a communication module of the computer.
The stored medium may denote a device-readable medium semi-permanently storing data, instead of a medium storing data for a short moment like a register, a cache, and a memory. In detail, examples of the stored medium may include read only memory (ROM), random access memory (RAM), CD-ROM, a magnetic tape, floppy disk, and an optical data storage device, but are not limited thereto. That is, the program may be stored in various recording mediums of various servers accessible by the computer or various recording mediums of the computer of a user. Also, the medium may be distributed to computer systems connected to one another over a network and may store a code readable by a computer in a distributed scheme.
The foregoing description of the present invention is for illustrative purposes, those with ordinary skill in the technical field of the present invention pertains in other specific forms without changing the technical idea or essential features of the present invention that may be modified to be able to understand. Therefore, the embodiments described above, exemplary in all respects and must understand that it is not limited. For example, each component may be distributed and carried out has been described as a monolithic and describes the components that are to be equally distributed in combined form, may be carried out.
The prevent invention may predict the outbreak of a disease on the basis of cohort data of an elderly group, and thus, may analyze the degree of risk of a target disease on the basis of all main risk factors.
The present invention may provide a risk analysis result of an elderly disease, thereby enabling medical facilities to easily provide objective diagnosis and a cure for a target disease.
The present invention may construct and apply a disease prediction model optimized for diseases of elderly persons of Korea to provide a high-accuracy analysis result of a target disease.
A number of exemplary embodiments have been described above.
Nevertheless, it will be understood that various modifications may be made. For example, suitable results may be achieved if the described techniques are performed in a different order and/or if components in a described system, architecture, device, or circuit are combined in a different manner and/or replaced or supplemented by other components or their equivalents. Accordingly, other implementations are within the scope of the following claims.
Number | Date | Country | Kind |
---|---|---|---|
10-2021-0030228 | Mar 2021 | KR | national |
10-2021-0081013 | Jun 2021 | KR | national |