This application claims priority to Chinese Patent Application Serial No. 202210532391.6, filed on May 11, 2022, the entire disclosure of which is incorporated herein by reference.
The present disclosure relates to processing massive data, and more specifically, relates to a method and a device for data processing based on a data value, and a storage medium.
Various industries are undergoing a digital transformation. Data is used by a digital platform to perform an intelligent service decision. However, big data results in problems such as a high calculation cost and a lack of a gripper in data management.
Full data are usually utilized due to a lack of effective data management, which are not only limited to computing resources but also are affected by noise data.
For example, an online medical platform desires to establish a consultation demand prediction model for each doctor, so as to optimize a resource configuration. If the full data are used to train the model, the calculation cost is relatively high. In addition, an influence of noise data such as click farming exists on the online medical platform. That is, some doctors encourage patients to perform false evaluations or click articles and gifts on the platform in order to obtain a higher return.
However, it is difficult to evaluate a value of a data point in the related art, and it is unable to manage the data in a targeted manner, which causes the use efficiency of data is not high.
According to a first aspect of the present disclosure, a computer-implemented method for data processing based on a data value is provided. The method includes: calculating a value of each piece of original data based on a utility function associated with a service revenue; acquiring pieces of high-value data from original data based on the value of each piece of original data; and performing a service prediction on the acquired high-value data.
According to a second aspect of the present disclosure, a computing device is provided, which includes a processor, and a memory stored with instructions executable by the processor. The processor is configured to: calculate a value of each piece of original data based on a utility function associated with a service revenue; acquire pieces of high-value data from original data based on the value of each piece of original data; and perform a service prediction based on the acquired high-value data.
According to a third aspect of the present disclosure, a non-transitory computer readable storage medium having computer programs stored thereon is provided. When the computer programs are executed by a computing device, a method for data processing based on a data value is implemented. The method includes: calculating a value of each piece of original data based on a utility function associated with a service revenue; acquiring pieces of high-value data from original data based on the value of each piece of original data; and performing a service prediction on the acquired high-value data.
The present disclosure may be more readily understood based on the following description of attached drawings, in which same numerals indicate the same structure units.
According to the following embodiments of the present disclosure, an intelligent service decision may be made based on historical data in a service system, since high-value data are screened out from a large amount of historical data, and a service prediction is performed on the basis of the screened high-value data. In this way, an operational burden of the service system may be reduced, and the accuracy of service prediction may be improved.
According to an exemplary embodiment of the present disclosure, a service revenue is used as a basis for measuring the data value, so that a service improvement is effectively associated with the data value, which may more effectively screen out the data that helps improve a decision-making performance of the service platform and facilitate to interpret a decision-making mechanism.
As an example,
As illustrated in
The original data may be data that is generated online, data that is generated and stored in advance, or data that is received from outside via an input device or a transmission medium. The data may relate to attribute information of an individual, an enterprise or an organization, such as an identity, an education, an occupation, an asset, a contact, a liability, an income, a revenue and a tax payment. Alternatively, the data may also relate to other attribute information of service-related items, for example, information about a transaction amount of a sales contract, both parties of the sales transaction, the object of transaction and a transaction place. It should be noted that data contents mentioned in the exemplary embodiment of the present disclosure may relate to a performance or property of any object or transaction in a certain aspect of the service, which is not limited to defining or describing an individual, an object, an organization, a unit, an institution, an item and an event.
The original data may be structured data or unstructured data from different sources, for example, text data or numerical data. The data may be derived from an internal entity that is expected to perform a service prediction, for example, a bank, an enterprise, a school, etc. from which a prediction result is expected to be acquired. The data may also be derived from an external entity such as a data provider, the Internet (for example, a social website), a mobile operator, an APP operator, an express company, a credit institution, etc. Optionally, the internal data and the external data may be used in combination to form original data carrying more information.
The original data may be input into the high-value data acquiring unit 100 by an input device, or may be automatically generated by the high-value data acquiring unit 100 based on existing data, or may be acquired by the high-value data acquiring unit 100 from a network (for example, a storage medium on the network (for example, a data warehouse)). In addition, an intermediate data exchange device such as a server may help the high-value data acquiring unit 100 acquire corresponding data from an external data source. A data conversion module such as a text analysis module in the high-value data acquiring unit 100 may convert the acquired data into a format that is easy to be processed. It should be noted that the high-value data acquiring unit 100 may be configured as various modules consisting of a software, a hardware and/or a firmware, some or all of which may be integrated or may work together to complete a particular function.
According to an exemplary embodiment of the present disclosure, the high-value data acquiring unit 100 may calculate a value of each piece of original data based on a utility function associated with a service revenue, and acquire at least a portion of the high-value data based on a calculation result. When the high-value data acquiring unit 100 screens out the high-value data, unification of data quality and service operation is achieved by taking the service revenue as a measurement index, so that the prediction performance may be improved more effectively while the operational burden is reduced.
In addition, the apparatus 10 for data processing as illustrated in
As illustrated in
According to an exemplary embodiment of the present disclosure, the original data may include at least one period of historical data that occurs before a current moment, and correspondingly, the estimation unit 200 performs the service prediction for a next period of data based on the acquired high-value data, to obtain prediction data corresponding to the current moment.
In addition, the apparatus 10 for data processing illustrated in
In the present disclosure, the estimation unit 200 may make an intelligent decision based on the high-quality data. For example, a model (for example, a machine learning model such as an artificial intelligence (AI) model) is trained by using high-quality data, and the service prediction is performed by using the trained model.
The training unit 210 is configured to train a service model by using the acquired high-value data. As an example, the training unit may train a machine learning model by using the acquired high-value data. Machine learning (ML) is an inevitable product of AI development to a certain stage, for improving the performance of a system by learning the experience via calculation. In a computer system, “experience” usually exists in the form of “data”, and a “model” may be generated from the data based on a machine learning algorithm, that is, empirical data are provided to the machine learning algorithm and a model is generated based on the empirical data. When a new situation is faced, the model may provide a corresponding determination, that is, a prediction result. Machine learning may be implemented as “supervised learning”, “unsupervised learning” or “semi-supervised learning”. It should be noted that, a particular machine learning algorithm is not limited by the exemplary embodiments of the present disclosure. In addition, it should also be noted that in a process of training and applying a model, other means such as a statistical algorithm may also be combined.
The prediction unit 220 may perform the service prediction by using a service model trained by the training unit 210. The model may be trained by the training unit 210 offline or online. The training not only includes a first training of the model, but also an update of the model. Accordingly, the prediction unit 220 may perform an offline or online prediction by using the trained service model, which is not limited in the exemplary embodiments of the present disclosure.
The apparatus illustrated in
The method for data processing according to an exemplary embodiment of the present disclosure is described referring to
As an example, the method illustrated in
At step S100, a value of each piece of original data is calculated based on a utility function associated with a service revenue.
As an example, the high-value data acquiring unit 100 may collect the original data in a manual, semi-automatic, or full-automatic mode, or process the collected original data, so that the processed data has an appropriate format or form. As an example, the high-value data acquiring unit 100 may collect the original data in batches.
The high-value data acquiring unit 100 may receive original data records manually input by a user by means of an input device (for example, a workstation). In addition, the high-value data acquiring unit 100 may take out the original data records from a data source system in a fully automatic mode. For example, a data source is systematically requested and the requested original data is obtained from the response by a timer mechanism implemented in a software, a firmware, a hardware, or a combination thereof. The data source may include one or more databases or other servers. Full-automatic data acquisition may be implemented via an internal network and/or an external network, in which encrypted data transmitted over the Internet may be included. When a server, a database and a network are configured to communicate with each other, data acquisition may be automatically performed without manual intervention. However, it should be noted that a certain user input operation may still be present in this manner. A semi-automatic mode is between a manual mode and a full-automatic mode. The difference between the semi-automatic mode and the full-automatic mode lies in that a trigger mechanism activated by a user replaces the above timer mechanism. In this case, a request for extracting data is generated only when a particular user input is received. In an example, when the original data are acquired each time, the captured data may be stored in a non-volatile memory. As an example, a data warehouse may be configured to store original data collected during acquisition, and processed data.
The acquired original data records may be derived from the same or different data sources, that is, each data record may also be a splicing of different data records. For example, in addition to acquiring an information data record (which includes attribute information fields such as income, education, post and asset conditions) filled when a client applies for signing up a credit card to a bank, as an example, the high-value data acquiring unit 100 may further acquire other data records of the client in the bank, such as loan record data and daily transaction data. The acquired data records may be spliced into a complete data record. In addition, the high-value data acquiring unit 100 may further acquire data derived from other private sources or public sources, for example, data derived from a data provider, data derived from the Internet (e.g., a social website), data derived from a mobile operator, data derived from an APP operator, data derived from an express company and data derived from a credit institution.
Optionally, the high-value data acquiring unit 100 may store and/or process collected data by means of a hardware cluster (such as a Hadoop cluster, a Spark cluster, etc.). The processing may be, for example, storage, classification and other offline operations. In addition, the high-value data acquiring unit 100 may also perform online streaming processing on the collected data.
As an example, the high-value data acquiring unit 100 may include a data conversion module such as a text analysis module, and correspondingly, at step S100, the high-value data acquiring unit 100 may convert unstructured data (such as a text) into structured data that is easy to use for subsequent processing or reference. Text-based data may include an email, a document, a web page, a graph, an electronic data table, a call center log, and a transaction report.
As an example, the exemplary embodiments of the present disclosure may use service data related to at least one of the following items: image recognition, e.g., optical character recognition (OCR), face recognition (security), object recognition (a traffic sign), and picture classification; speech recognition, e.g., natural language processing fused in a voice assistant; natural language processing, e.g., examining a text (e.g., a contract, a legal document, a customer service record), spam recognition, text classification (emotion, intention, theme); automatic control, e.g., automatic control on energy industries (a mine, a wind turbine generator system), energy saving (an air conditioning system); intelligent question answering, e.g., a chatbot, or an intelligent customer service, operation decision-makings (e.g., financial science and technology, including marketing and customer obtaining, anti-fraud, anti-money laundering, underwriting and credit scoring, etc.), medical treatment (e.g., disease screening and prevention, personalized health management, auxiliary diagnosis), municipal administration (e.g., social governance and supervision law enforcement, resource environment and facility management, industry development and economic analysis, public service and livelihood guarantee, smart city, etc.), and recommendation services (e.g., advertisement, consultation, music, videos, financial products (financing, insurance).
After the original data are acquired, the high-value data acquiring unit 100 may calculate the value of each piece of original data based on the utility function associated with the service revenue.
In an exemplary embodiment of the present disclosure, the high-value data acquiring unit 100 calculates a data value based on the utility function related to the service revenue. The high-value data acquiring unit 100 may construct a calculation method of a service revenue based on a period of service prediction, analyze each component in the service revenue, extract a portion related to data contribution, and construct the utility function for calculating the data value based on the extracted portion related to data contribution. As an example, the high-value data acquiring unit 100 may calculate a value of the original data based on a data Shapley value. Specifically, the high-value data acquiring unit 100 may calculate the data Shapley value of each piece of original data based on the utility function associated with the service revenue as the value of each piece of original data.
At step S200, pieces of high-value data are acquired from the original data based on the value of each piece of original data.
At step S300, the service prediction is performed based on the pieces of high-value data.
In an embodiment, the estimation unit 200 may perform a prediction only based on the high-value data screened out. As an example, a service model may be trained by the training unit 210 in the estimation unit by using the acquired high-value data. The high-value data used may only include high-value data acquired in a current period, or may include all high-value data acquired historically, or may be any periods or some periods of high-value data screened. As an example, the training unit 210 may train the service model by using any applicable machine learning algorithm. Then, the prediction unit 220 in the estimation unit 200 performs the service prediction by using the service model trained by the training unit 210. The prediction unit 220 may respectively input all the data to be predicted into the service model, to acquire a prediction result corresponding to each piece of data to be predicted. Further, a corresponding service decision may be made based on the prediction result, for example, service resource allocation.
Exemplary embodiments of the present disclosure are described below with an online medical platform as an example. However, it should be understood that the exemplary embodiments of the present invention are not limited to the online medical platform, but are applicable to any similar service prediction system, such as a prediction of airport passenger flow distribution, a music trend prediction, a demand prediction and storage planning solution, a prediction of Sina Weibo interaction volume, a prediction of monetary fund inflow and outflow, a movie box office prediction, a prediction of agricultural product price, a prediction of Tibetan Plateau lake area based on multi-source data, a prediction of microblog propagation scale and propagation depth, an abalone age prediction, a prediction of student performance ranking, a travel flow prediction of online car-hailing, a prediction of red wine quality score, a prediction of search amount and stock price fluctuation of a search engine, a prediction of rural resident income growth, a real estate sales influence factor analysis, a prediction of stock price trend, a prediction of national comprehensive transportation total amount and an earthquake prediction.
According to an exemplary embodiment of the present disclosure, a service model may relate to an online medical platform, and correspondingly, a service prediction is performed to predict a number of treatments that each doctor may undertake in a next period based on attribute information of each doctor on the online doctor medical platform.
In the rapid development of digital economy, traditional medical resources are combined with a digital platform to generate an emerging medical mode—an online medical platform. Compared with a traditional hospital, the online platform eliminates a spatial distance between medical resources and patients, and provides more treatment options for the patients in an era of COVID-19. Moreover, the platform may fully integrate and utilize resources by using the platform data, which greatly improves the efficiency of allocating the medical resources. For example, the online medical platform accumulates a large amount of data during its operation, and improves the platform efficiency by making a personalized recommendation and demand matching based on data, to acquire more data, thereby constituting a “virtuous cycle” of data-operation.
A typical service mode of the platform is as follows: a patient retrieves a desired doctor on the platform in combination with disease symptoms of the patient, and selects his/her favorite doctor for consultation based on doctor static data in a doctor list returned by the platform. For example, for a recommended doctor in “hypertension” diseases retrieved by a patient, the patient selects an appropriate doctor for consultation based on doctor homepage information and data such as “doctor gender”, “age”, “gift” and “thank-you note”.
The online medical platform fuses data with the traditional medical industry, which injects new power into the traditional industry. However, a problem of platform data governance is also increasingly concerned. On one hand, since big data describes medical behaviors on a platform more clearly, the platform achieves precise marketing with a finer granularity. However, the platform also needs to balance the data calculation cost and the effectiveness. On the other hand, there is also a “data click farming” phenomenon in the platform (such as malicious click farming, unreal evaluation), resulting in distortion of data. Therefore, it is very important for the platform to perform an appropriate value evaluation on data and screen out high-value data. In this way, the data calculation cost is reduced without reducing the algorithm performance, and a good platform ecology is constructed by screening out unreal data.
There is still a challenge to evaluate the data value in the related art. Determination of a data value may also facilitate data fusion, which provides a basis for data governance. The current research about the data value is mainly based on the perspective of economics from top to bottom, and a main measurement method includes a cost method, a market method and an income method. The three methods are constrained by data regulation, an incomplete data market and difficult splitting of the data value, resulting in a large measurement deviation of the data value.
According to an exemplary embodiment of the present disclosure, on an online medical platform, a doctor is employed to complete an online consultation service. It is very important for a platform to properly allocate a doctor resource (signing a contract that needs a longer online time with a doctor in high demand, and vice versa). Therefore, it is necessary to estimate a demand of the doctor in the future by using the data. For example, a total demand (e.g., a number of consultation orders) of the doctor within a period of time is predicted by using machine learning based on static data accumulated in a period of time, such as, a doctor age, a gender, a hospital level, a gift, a thank-you note, a patient registration, a waiting duration and a patient vote. Accordingly, a decision is made (a doctor resource allocation). However, since the data amount used for prediction is large and there are noise data due to behaviors such as click farming and fake comment, the loss in both the precision and the calculation cost may be caused by simply using full data. In addition, a black box model used by a platform is complex and the platform fails to make an effective interpretation of the black box model. As an example, in a machine learning process, the value of the used data may be evaluated to provide a guidance for data management of the platform, which reduces a calculation cost of the platform, improves a learning progress, and interprets an underlying logic of the black box model. For explanation of the model, the high-value data reflects data with a significant model prediction effect, and reflects data distribution in a normal service mode. For example, a high consultation amount corresponds to a large number of thank-you notes. From the high-value data, it may be determined what data distribution helps the model obtain a good result.
For a data value evaluation framework that is updated online in an online medical platform illustrated in
A “prediction+operation” process of the platform is described. In order to reasonably allocate doctor resources, the platform predicts a demand to a doctor by using the doctor static data accumulated in a previous period, and the data used for prediction are illustrated below:
The platform may decide how to allocate the current resources based on the prediction result, so as to further accumulate data.
First, it is necessary to combine operation of a platform with a prediction behavior of machine learning to construct a corresponding utility function.
For example, for a medical platform, it is assumed that that time is discrete and infinite, t∈{0, 1, . . . }, an average time of a doctor i∈{1, 2, . . . n} serving a patient τi (τi=τ), a revenue share acquired by the platform from one service is ri>0. The platform decides a cooperation agreement that a doctor signs in a t+1 period based on historical data in a t period. An online service duration required in a t period is denoted as Sit, and a salary per unit duration is denoted as wi>0. The data collected by the platform during the t period is denoted as Xt, and a patient demand of the doctor may be denoted as a random variable Dit. The doctor cannot serve a patient who requests for exceeding a working duration. Such patient may be served by the platform with the cost α>0. In the case, a total revenue of the platform during the t period is expressed as:
It can be seen from the above equation that the more accurate the patient demand estimation of a single doctor by the platform, the more revenue the platform may obtain. Assuming that the platform estimates a next period demand of each doctor through machine learning based on the historical data, a cooperation agreement is established accordingly, and a working duration of the doctor in the next period is expressed as:
S
it
=τ
[D
it|t-1]
where t-1=[{Xk}k=0t-1] is a set of all historical data acquired before the t period, which embodies a complete chain of “data→prediction→decision→value”.
Further, let
ΔDit={circumflex over (D)}it−Dit, the above equation (3.1) may be accordingly adjusted as follows.
When the demand estimation of the platform has a deviation, the allocation of resources may be temporarily adjusted to make up for the error of demand prediction, and when the demand prediction of the platform has a large deviation, an extra salary may be paid to the doctor, thereby reducing the platform revenue. When the predicted demand is insufficient, the platform needs to temporarily and additionally adjust a resource allocation. Assuming that a cost of adjusting the resource allocation cannot bring a positive revenue for the platform, the platform may be forced to make a more accurate prediction in advance. At this time, it is obtained ∀i, wiτ+α<ri, where wi>0.
It is noted that a revenue loss of the platform is closely related to ΔDit, that is, a predicted absolute error. Specifically, since the predicted error may lead to the revenue loss of the platform (R*t−Rt∝MAE(D, {circumflex over (D)})), the platform may define a utility function by using the predicted mean absolute error (MAE). The utility function is shown as below:
That is, according to the exemplary embodiments of the present disclosure, the utility function is associated with a predicted error of a service model trained by using at least a portion of original data. The accuracy of platform data represented in machine learning is associated with the platform value by using the utility function.
In addition, optionally, the high-value data acquiring unit according to the exemplary embodiments of the present disclosure, may construct different utility functions in a similar way. For example, the utility function associated with the service revenue may be adjusted based on at least one of a service crowd, a service logic, an external environment and a time change.
Next, the data value may be calculated based on the utility function. According to the exemplary embodiments of the present disclosure, a data Shapley value of each piece of original data may be calculated based on the utility function associated with the service revenue as the value of each piece of original data.
In particular, the Shapley value is proposed by Lloyd Shapley to address equitable distribution of cooperation revenues. The Shapley value from a theory in the Game Theory is determined based on a “marginal contribution” of a participant, and in a cooperative game, the marginal contribution may be regarded as an influence on the cooperation revenues after a participant engages in the game. The contribution of each participant may be calculated by traversing a combination of all participants, and a specific equation is as below:
where K is a universal set, and n=|K|, i∈K, the utility function v represents revenues of the cooperative game, φz(v) is a contribution of an element z under the definition of the utility function v, that is, a Shapley value of the element z.
According to an exemplary embodiment of the present disclosure, z may be regarded as each data point or feature in the model, and v may be considered as a loss function or accuracy (such as a utility function) that measures a model effect. In this case, φz(v) is the contribution of the data point or feature z to the loss function or the accuracy of the entire model, the value evaluation mode retains completeness, fairness and additivity of the Shapley value. The value of the data point may be effectively extracted in a machine learning scenario.
It can be seen from the above equation that the calculation complexity of the Shapley value is exponential, and “utility” of the calculation model in each literation means re-training the model by using a data subset or a feature subset U, which leads to a great calculation burden when model learning based on a large amount of data.
For this purpose, optionally, the high-value data acquiring unit may perform random permutation on a set of all the original data in each iteration of calculating the utility function. When a difference between a utility function value of a set consisting of any piece of original data and previous elements before the any piece of original data and a utility function value of the set of all the original data is less than a preset threshold, the high-value data acquiring unit may remain a utility function value of the any piece of original data unchanged for a set consisting of the previous elements.
That is, the control of algorithm complexity is performed by simplifying data or feature permutation. For example, the Shapley value may be expressed in the data or feature permutation:
where π∈Π(K) is a permutation of all elements in a universal set, and Piπ is a set consisting of elements before the element i in the permutation. In other words, Piπ={π[1], . . . , π[j]}, where π[j+1]={i}. First, random permutation on a set of elements is performed. Second, when scanning from the first element to the last element, it is defined that a difference between a utility of a set consisting of a certain element and its previous elements and a utility of the universal set is less than a certain threshold, a marginal contribution (utility) of the certain element to the set including previous elements before the certain element is defined as 0. Finally, a Shapley value of the element is updated by using an average value. A pseudo code is as follows:
According to the method, if a threshold is satisfied, a marginal utility of a data node is regarded as 0, and each permutation of an element set is arbitrary. In addition, the method uses a property of the Shapley value, that is, under a certain assumption, as a data volume or a feature volume increases, a utility of the data set may be converged to the utility of the data or feature universal set, which may be expressed as
In other words, when a data volume is small, a contribution of the newly added data to the model is relatively high; and however, when the data volume is large, a contribution of the newly added data to the model may decrease and be close to 0.
In fact, an approximate Shapley value may be obtained within a shorter time by controlling a number of iterations and a threshold of the algorithm. Compared with the existing method of calculating the Shapley value, the calculation efficiency of the disclosure is improved to a certain extent.
As another example, considering that the performances of most machine learning models are continuous in a feature space, adjacent data points tend to have the same prediction results. Based on the property, the high-value data acquiring unit may calculate the data Shapley value of each piece of original data by: training a universal set service model based on the set of all the original data, recording a universal set prediction result of each piece of original data in the universal set service model, and in each iteration of calculating the utility function, acquiring a prediction result of the original data based on a universal set prediction result of at least one piece of original data close to the original data. In this way, it may be avoided a model is re-trained.
As an example, the model is trained based on a complete training set, and a prediction of each data point may be recorded. Further, when prediction results in a prediction set need to be calculated in each iteration, the predicted results may be estimated by using the averaging of K nearest predicted values of the training set. That is, for any data point (x, y)∈S′ in a test set S′, assuming that a K-nearest neighbor set of x in the training set is Dx, and a prediction model is f, then a predicted value ypre for y is expressed as:
Finally, the utility function v is calculated based on all the approximate prediction results in the prediction set, and there is no need to re-train the model, so that a calculation time may be greatly shortened.
It should be understood that the foregoing two optimization processing manners may be used alone or in combination.
Assume that the online medical platform has about 20 million pieces of doctor data, and may acquire doctor data in the years of 2018, 2019 and 2020. In order to reduce a calculation complexity, 5000 pieces of doctor information and their corresponding consultation order information are extracted from the doctor data to conduct a verification research. A model for predicting demands of a doctor in next year is constructed by utilizing doctor static information and platform behavior information, to research the value of the doctor data accordingly.
Based on data characteristics, the selected doctor data may be divided into the following two types. A first one is doctor registration information, which is information provided by a doctor when the doctor registers on an Internet medical platform, including gender, title, hospital level of the doctor. This information reflects a consultation capability that the doctor statically provides. A second type is doctor platform behavior information, which is obtained based on consultation behavior statistics of the doctor on the platform, including “total number of consultation orders”, “Number of Articles”, “a registration number of post-diagnosis patients”, “patient voting”, “thank-you note”, “gift”, “general waiting duration” and “comprehensive recommendation popularity”.
An actual meaning of data is described in Table 1.
As an example, 5000 pieces of data are extracted, in which 4000 pieces are randomly extracted as training samples, and the remaining 1000 pieces are used as a test set. As illustrated in
The qualitative variables “gender”, “hospital level” and “general waiting duration”, embodies a doctor composition and a service comfort level of the platform. It can be seen from Table 2 that distribution of male and female proportions in two qualitative variables of a hospital level and a waiting duration are relatively stable; a doctor focuses on the grade 3A hospital, and the waiting time of doctor consultation is mostly “zero”, and it can be seen that a doctor service efficiency in the platform is relatively high.
According to an exemplary embodiment of the present disclosure, a total number of orders for doctors in 2019 to 2020 may be predicted by using an XGBoost based on doctor behavior information and registration information (remaining variables except for the total number of orders in Table 1) at the beginning of 2019. According to the above preferred mode, values of different data sets or data points are estimated to address different data problems in the platform, which are described below in several parts.
In the art of big data, a larger data volume means a higher accuracy of machine learning. Therefore, accumulation of data is crucial to value accumulation of the platform. In an online platform, a cold start problem occurs due to a large error in an operation strategy since a data volume accumulation of a platform participant is less and the platform has difficulty in data cognition. When the data volume is large enough, precise matching is relatively easy for the platform. However, since the online platform needs to quickly respond to a user request, the data volume may be restricted to a computing power. Therefore, it is important for the platform how to balance values of different data sets and calculation costs.
By using actual data of the online platform, sets with different data volumes are respectively taken, Shapley values of the sets are calculated to measure values of data sets of different sizes and are compared with a classical information entropy (estimated by using Kozahenko-Leonenko). For the platform, the quality of a data set is directly related to a prediction precision MAE. It can be seen that with the increase of the data set size, a prediction error gradually decreases, and a Shapley value gradually increases. That is, a value of the data increases. The Shapley value may depicts that the value increases as the data set size increases. An entropy is a function of a random variable, and is directly determined by its probability distribution. Therefore, entropies of different data sets shall be consistent. However, in actual calculations, it is necessary to estimate the entropy by using a numerical method (Kozahenko-Leonenko), and the increase of the data volume makes its distribution more approximate to a true value, resulting in a decrease of gradual convergence. Therefore, the Shapley value more directly captures a value of the data volume.
In the upper part, the Shapley value successfully captures the value increase brought by the increase of the data volume, so that the prediction precision is improved. In the section, several different training sets are selected to compare a prediction error (MAE) with a platform total revenue (Rt) and obtain a relationship.
Based on the first two sections, the Shapley value successfully associates the platform revenue with the evaluated data value. In addition, it may be observed how each data point in the model affects the model, that is, how each data point affects the platform revenue, thereby interpreting the model and assisting in formulation of a platform operation strategy.
According to the defined utility function, a Shapley value of each data point in a training set may be calculated. As illustrated in Table 3, an SV_quantile may rank the Shapley values from low to high. There are five groups divided based on each 20% quantile, in which 0%-20% represents a group with the lowest Shapley value, and 80%-100% is a group with the highest Shapley value. Remaining variables in the table 3 represent an average of corresponding values within one group.
Firstly, two extreme groups of data are observed. Compared with other groups of data, the group (80%-100%) of data with the highest “value” is characterized by more gifts and thank-you notes and a higher total number of orders. That is, this conforms to a high correlation between any two variables as shown in
Form 5 groups of data as a whole, it may be seen that two groups of data (20-60%) are relatively approximate, reflecting a basic situation of a large amount of data, and groups of data (60-100%) reflects a data situation that there is a large number of orders. In combination with the above long tailed data distribution, data points with a relatively high total number of orders are correctly predicted, which makes a relatively high contribution to the utility function, and the data value is high. In other words, the platform correctly selects data points with a relatively high demand for the doctor in the actual service are of relatively high value.
In the above model calculation, data with a higher “value” may successfully reflect a positive correlation between each of the quantitative variables (such as a thank-you note and a gift) in the data distribution and an order number. Capturing the correlation helps the platform correctly establish a relationship between doctor data and a service capability of the doctor data. However, low-value data violates the overall data distribution, and generates an interference on a specified strategy of the platform, which reduces the accuracy of the prediction model. Thus, the platform needs to delete the data that generate the interference, so as to ensure the maximum revenue of the platform strategy under the overall samples.
Malicious click farming and comment click farming on the online platform are common problems. Reputations of a doctor accumulated on the platform (such as a number of articles and gifts) are important factors that attract a patient, and such data are deliberately clicked by some doctors. Therefore, demand prediction is distorted, which results in difficulty of platform operation. The Shapley value may recognize the contribution of data in the model, and the user with malicious click farming may have a negative effect on the model, which leads to a lower Shapley value. In this regard, the platform may appropriately delete some low-value data points to improve the quality of data used by the platform.
In the experiment of this section, a model effect of deleting low-value data is verified, data points in the training set are permutated based on the Shapley value. Then, a data point is removed in the permutation from low to high, the model is re-trained when one data point is removed each time, and a change of the model effect (MAE, an absolute error between the predicted value and the real value) is observed. The way of removing the data point in the permutation based on the Shapley value is compared with the way of removing the data point in a random sequence, so as to reflect a correction effect of the Shapley value to model prediction. Experimental results are illustrated in
Thus, data points with low values affect the overall performance of a model, and it is extremely likely that these data points correspond to some participants of data fraud in the platform. Therefore, it is meaningful that the platform screens out low-value data, which provides high-quality data learning samples for achieving accurate doctor recommendation and doctor patient resource matching in future based on the value calculation of doctor demand data.
In the present disclosure, in a resource allocation scenario of the online medical platform, the value of each data point used by the platform may be effectively measured, and a feasible solution is provided for the platform to improve the operation effect from data to model prediction to platform decision-making.
An apparatus and a method for data processing according to an exemplary embodiment of the present disclosure are described above with reference to
The computer program in the computer-readable medium may be running in an environment deployed in computer devices such as clients, hosts, agent devices, servers, etc. It should be noted that the computer program may also be configured to perform additional steps in addition to the above steps or perform more specific processing when performing the above steps, which are described referring to
It should be noted that the apparatus for data processing according to the exemplary embodiment of the present disclosure may fully rely on the running of the computer program to achieve a corresponding function. That is, each apparatus corresponds to each step in a functional architecture of the computer program, so that the entire system is called by a specialized software package (for example, a lib library) to achieve a corresponding function.
In addition, each apparatus illustrated in
For example, the exemplary embodiment of the present disclosure may also be implemented by a computing device. The computing device includes a storage component and a processor, and a set of computer executable instructions is stored in the storage component. When the set of computer-executable instructions is executed by the processor, a method for data processing according to the exemplary embodiment of the present disclosure is performed.
Specifically, the computing device may be deployed in a server or a client, or may be deployed on a node device in a distributed network environment. In addition, the computing device may be a PC computer, a tablet device, a personal digital assistant, a smartphone, a web application, or other devices capable of executing the instruction set.
The computing device is not necessarily a single computing device, and may be an assembly of any apparatus or circuit capable of executing the instructions (or instruction sets) alone or in combination. The computing device may also be a part of an integrated control system or a system manager, or may be configured as a portable electronic device that is connected with local or remote devices via interfaces (e.g., via wireless transmission).
In the computing device, a processor may include a central processing unit (CPU), a graphics processing unit (GPU), a programmable logic device, a dedicated processor system, a microcontroller or a microprocessor. As an example, the processor may also include an analog processor, a digital processor, a microprocessor, a multi-core processor, a processor array, a network processor, and the like.
Some operations described in the method for data processing according to an exemplary embodiment of the present disclosure may be implemented by a software, or a hardware, or a combination of a software and a hardware.
The processor may run instructions or codes stored in one of storage components. The storage component may also store data. Instructions and data may be also sent and received over a network via a network interface device. The network interface device may employ any known transmission protocol.
The storage component may be integrated with the processor. For example, a RAM or a flash memory may be arranged within an integrated circuit microprocessor. In addition, the storage component may include an independent device, such as an external disk drive, a storage array, or other storage devices that may be used by any database system. The storage component and the processor may be coupled in operations, or may communicate with each other, such as through an I/O port, a network connection, so that the processor may read a file stored in the storage component.
In addition, the computing device may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, a mouse and a touch input device). All components of the computing device may be connected to each other via a bus and/or a network.
The operations involved in the method for data processing according to the exemplary embodiment of the present disclosure may be described as various interconnected or coupled functional blocks or functional diagrams. However, the functional blocks or functional diagrams may be equally integrated into a single logic device or operated based on a non-exact boundary.
For example, as described above, the apparatus for data processing based on a data value according to the exemplary embodiment of the present disclosure may include a storage component and a processor. The storage component stores a set of computer executable instructions. When the set of computer executable instructions is executed by the processor, the following steps are performed: acquiring at least a portion of high-value data from original data; and performing a service prediction on the acquired high-value data. The value of each piece of original data is calculated based on a utility function associated with a service revenue, and the at least a portion of high-value data is acquired based on a calculation result.
Exemplary embodiments of the present disclosure are described above, and it should be understood that the above description is merely exemplary and is not intended to be exhaustive, which are not limited in the disclosure. Various modifications and variations are apparent for those skilled in the art without departing from the scope and spirit of the disclosure. Therefore, the protection scope of the present disclosure should be subject to the scope of claims.
Number | Date | Country | Kind |
---|---|---|---|
202210532391.6 | May 2022 | CN | national |