The present disclosure belongs to the technical field of natural language processing in product comment analysis, and particularly relates to a method for analyzing fine-grained text sentiment based on the users' harshness.
Sentiment analysis is a process of analyzing, processing, summarizing and reasoning subjective texts with sentimental color. In recent years, with the rapid development of online commercial websites such as Taobao, JD.COM, Douban, etc., sentiment analysis has attracted wide attention of researchers, as well as people from industry. Sentiment analysis of product comments can not only help consumers to select effective information, but also help manufacturers improve their products.
Online product comments are mostly document-level, usually including evaluations of various product attributes. However, traditional sentiment analysis methods analyze product comments as a whole, ignoring product attributes, thus failed to evaluate products accurately. Therefore, fine-grained sentiment analysis of product comments is necessary. At present, researchers have done a lot of research on this field by giving product attributes in advance or extracting product attributes from comments. Wang Yequan et al. proposed a model ATAE based on LSTM in 2016, and Xue Wei et al. proposed a model GCAE based on CNN in 2018.
However, at the same time, there are still some problems in sentiment analysis, such as different users' harshness. That is to say, different users have different evaluation criteria for the same product due to the influence of culture, life experience and other factors. For example, some users tend to give good comments to all products, while others tend to give bad comments to all products. If this factor is not taken into account, it is difficult to get high-quality sentiment analysis results about product comments. Therefore, it is of great significance to study a sentiment analysis method considering the users' harshness.
In considering of the problems that the analysis granularity is too large and the users' harshness are different in the sentiment analysis process of product comments, the present disclosure provides a method for analyzing fine-grained text sentiment based on the users' harshness, which could further improves the sentiment analysis accuracy in product comments.
The purpose of the present disclosure is realized through the following technical solution: a method for analyzing fine-grained text sentiment based on the users' harshness, This method mainly consists of the following steps:
More specifically, the step (3) includes the following steps:
The step (3-1) could be further into the following steps:
The step (3-2) could be further divided into the following steps:
The step (3-3) could be further divided into the following steps:
Further, the calculation method in the step (d) of the step (3-3) is as follows: firstly, calculate a sum Aij of L2 distances between a certain noun i and all attribute words in a certain product attribute j, and then add the distances Aij of all nouns corresponding to a certain product attribute j to obtain a distance value Aj corresponding to each product attribute j, and taking the product attribute with the shortest Aj as the product attribute corresponding to the sentence.
The step (5) could be further divided into the following steps:
(a) Obtain the sentiment probability corresponding to a user id, a product id and an attribute statement vector of each comment; wherein the sentiment probability Pijt corresponding to each attribute statement is:
P
ij
t=(Pgood,Pmedium,Ppoor)
where i is the user id, j is the product id, and t is the product attribute, Pgood+Pmedium+Ppoor=1.
(b) Calculate a confusion matrix R, R(xs, xy) indicates a probability that an attribute statement evaluation sijt is xs when an objective product attribute evaluation yjt is xy:
where xy, xs∈(good, medium, poor);
(c) Calculate an E step in the EM algorithm: calculating a conditional probability of the objective product attribute evaluation yjt:
where p(sijt|yjt, μi, τjt, R) represents the conditional probability that the attribute statement evaluation is sijt when the objective evaluation of product attribute yjt, users' harshness parameter μi, a product evaluation difficulty τjt and the confusion matrix R are given; p(yjt|S, μ, τ, R) indicates the conditional probability of the objective product attribute evaluation yjt when an attribute statement evaluation set S, the users' harshness parameters μ, the product evaluation difficulty τ and the confusion matrix R are given; T is the number of possible results for product evaluation;
(d) Calculate an M step in the EM algorithm: solving the users' harshness μ and the product evaluation difficulty τ for maximizing a Q function:
where Y is an objective product attribute evaluation set, and S is an attribute statement evaluation set;
(e) Optimize the Q function by the SLSQP gradient descent optimization algorithm, and stop when the maximum number of cycles is reached or a difference between the Q functions of two adjacent iterations is less than a threshold value, so as to obtain the users' harshness μ, the product evaluation difficulty τ and the result Y of the product attribute evaluation set.
Compared with the previous method, the present method has the beneficial technical effects that the product attributes are automatically extracted according to the comments in the data set, and the comments are segmented according to the product attributes, so as to achieve the effect of fine-grained analysis; and then, an inference model considering the users' harshness is adopted, so that the accuracy of sentiment analysis is further improved.
In order to make the above objects, features and advantages of the present disclosure more obvious and understandable, the specific embodiments of the present disclosure will be described below in detail with reference to the accompanying drawings.
In the following description, many specific details are set forth in order to fully understand the present disclosure, but the present disclosure can be implemented in other ways different from those described here, and those skilled in the art can make similar promotion without violating the connotation of the present disclosure. Therefore, the present disclosure is not limited by the specific embodiments disclosed below.
As shown in
More specifically, the step (3) includes the following steps:
The step (3-1) could be further divided into the following steps:
Further, step (3-2) includes the following steps:
The step (3-3) could be divided into the following steps:
(a) Use StanfordCoreNLP Natural Language Processing Toolkit to split the comment statements and marking the part of speech of the words;
(b) Extract all the nouns in a sentence and removing product names;
(c) Pre-train TF-IDF model based on the data set, and delete nouns with too low TF-IDF values or too high word frequencies;
(d) Calculate L2 distances between the nouns screened in the sentence and the attribute words in four product attributes based on the pre-trained Word2Vec model, and selecting a product attribute with the shortest distance sum as the product attribute corresponding to the sentence as follows.
Firstly, calculate a sum Aij of L2 distances between a certain noun i and all attribute words in a certain product attribute j, and then further adding the distances Aij of all nouns corresponding to a certain product attribute j to obtain a distance value Aj corresponding to each product attribute j, and taking the product attribute with the shortest Aj as the product attribute corresponding to the sentence.
(e) Merge sentences with the same product attribute, so as to divide a general comment statement into vectors containing four attribute statements.
More specifically, step (5) includes the following steps:
(a) Obtain the sentiment probability corresponding to a user id, a product id and an attribute statement vector of each comment; wherein the sentiment probability Pijt corresponding to each attribute statement is:
P
ij
t=(Pgood,Pmedium,Ppoor)
where i is the user id, j is the product id, and t is the product attribute, Pgood+Pmedium+Ppoor=1.
(b) Calculate a confusion matrix R, R(xs, xy) indicates a probability that an attribute statement evaluations is xs when an objective product attribute evaluation yjt is xy:
where xy, xs∈(good, medium, poor);
(c) Calculate the E step in the EM algorithm: calculate a conditional probability of the objective product attribute evaluation yjt:
where p(sijt|yjt, μi, τjt, R) represents the conditional probability that the attribute statement evaluation is sijt when the objective product attribute evaluation yjt, users' harshness μi, product evaluation difficulty τjt and the confusion matrix R are given; p(yjt|S, μ, τ, R) indicates the conditional probability of the objective product attribute evaluation yjt when an attribute statement evaluation set S, users' harshness μ, the product evaluation difficulty τ and the confusion matrix R are given; T is the number of possible results for product evaluation;
The method uses the parameter 1/τjt∈[0, ∞) to model the evaluation difficulty τjt of the product attributes, wherein τjt is constrained to be positive; 1/τjt=∞ here means that product attribute evaluation is very difficult, that is, product attributes are similar to most product attributes, or a lot of background knowledge and relevant abilities are needed to evaluate; even for a relatively professional expert, there is only 33.33% chance of correct evaluation; 1/τjt=0 means it is easy to evaluate product attributes; the larger 1/τjt, the more difficult it is for users to evaluate product attributes;
In addition, the harshness of each user is modeled by the parameter μi∈(−∞, +∞); μi=+∞ means that user always gives the same evaluation as most people; μi=−∞ means that the user always gives a different evaluation from most people, and his/her standard is particularly different from that of ordinary people; finally, μi=0 means that the user knows nothing about this product, that is, his/her evaluation does not carry information about the objective evaluation, but gives the evaluation randomly;
(d) Calculate the M step in the EM algorithm: infer users' harshness μ and the product evaluation difficulty τ for maximizing a Q function:
where Y is an objective product attribute evaluation set, and S is an attribute statement evaluation set;
(e) Optimize the Q function by the SLSQP gradient descent optimization algorithm, and stopping when the maximum number of cycles is reached or a difference between the Q functions of two adjacent iterations is less than a threshold value, so as to obtain the users' harshness μ, the product evaluation difficulty τ and the result Y of the product attribute evaluation set.
The applicant conducted an experiment on the movie data set collected by the Internet Movie Database (IMDB) online comment website. At the same time, Accuracy, F-Measure, Kappa coefficient and Kendall correlation coefficient are used to measure the experimental results, which were compared with ATAE model and GCAE model. The final results are shown in the following table 1,
The steps of the method or algorithm described combined with the embodiments of the present disclosure may be implemented in a hardware manner, or may be implemented in a manner in which a processor executes software instructions. The software instructions may consist of corresponding software modules, and the software modules can be stored in Random Access Memory (RAM), flash memory, Read Only Memory (ROM), Erasable Programmable ROM (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), registers, hard disks, removable hard disks, CD-ROMs or any other forms of storage media well-known in the art. An exemplary storage medium is coupled to the processor, such that the processor can read information from, and write information to, the storage medium. The storage medium can also be an integral part of the processor. The processor and storage medium may reside in an Application Specific Integrated Circuit (ASIC). Alternatively, the ASIC may be located in a node device, such as the processing node described above. In addition, the processor and storage medium may also exist in the node device as discrete components.
It should be noted that when the data compression apparatus provided in the foregoing embodiment performs data compression, division into the foregoing functional modules is used only as an example for description. In an actual application, the foregoing functions can be allocated to and implemented by different functional modules based on a requirement, that is, an inner structure of the apparatus is divided into different functional modules, to implement all or some of the functions described above. For details about a specific implementation process, refer to the method embodiment. Details are not described herein again.
All or some of the foregoing embodiments may be implemented by using software, hardware, firmware, or any combination thereof. When the software is used for implementation, all or some of the embodiments may be implemented in a form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a server or a terminal, all or some of the procedures or functions according to the embodiments of this application are generated. The computer instructions may be stored in a computer-readable storage medium or may be transmitted from a computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from a web site, computer, server, or data center to another web site, computer, server, or data center in a wired (for example, a coaxial optical cable, an optical fiber, or a digital subscriber line) or wireless (for example, infrared, radio, or microwave) manner. The computer-readable storage medium may be any usable medium accessible by a server or a terminal, or a data storage device, such as a server or a data center, integrating one or more usable media. The usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, or a magnetic tape), an optical medium (for example, a digital video disk (DVD)), or a semiconductor medium (for example, a solid-state drive).
The above is only the preferred embodiment of the present disclosure. Although the present disclosure has been disclosed in the preferred embodiment, it is not intended to limit the present disclosure. Anyone who is familiar with the art can make many possible changes and modifications to the technical solution of the present disclosure by using the above disclosed methods and technical contents without departing from the scope of the technical scheme of the present disclosure, or modify them into equivalent embodiments with equivalent changes. Therefore, any simple modifications, equivalent changes and modifications made to the above embodiments according to the technical essence of the present disclosure without departing from the content of the technical scheme of the present disclosure shall fall within the scope of protection of the technical scheme of the present disclosure.
Number | Date | Country | Kind |
---|---|---|---|
202010504181.7 | Jun 2020 | CN | national |
The present application is a continuation of International Application No. PCT/CN2020/127655, filed on Nov. 10, 2020, which claims priority to Chinese Application No. 202010504181.7 filed on Jun. 5, 2020, the contents of which are incorporated herein by reference in their entireties.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2020/127655 | Nov 2020 | US |
Child | 18074540 | US |