This application is the National Stage of International Application No. PCT/CN2013/075649, filed May 15, 2013, which claims the benefits of Chinese Patent Application No. 201210218551.6, filed Jun. 27, 2012, the disclosures of which are incorporated herein by reference in their entireties.
The present invention relates to the technical field of network information filtering, and especially to a system and a method for filtering keywords.
In the age of web 2.0, contents created by Internet users are very broad. A large amount of text contents are generated on the Internet every day, such as, posts on BBS (Bulletin Board System) forum, articles on Blogs and text information on the newly booming Micro-blog. The text contents created by users cover almost everything. However, some of contents involve eroticism, fraud and politically sensitive information. Such contents may affect on-line experience of readers or lead to mental or even economic damages. Therefore, it is urgent for each ICP (Forum, Blog or Micro-blog provider) to effectively and timely filter the data created by users, thereby cleaning forum data and improving user experiences.
In the prior arts, in order to filter the contents containing sensitive information timely, a common method is a scanning technique based on keyword contents, which is particularly scanning keyword(s) related to sensitive information. For example, the keywords such as “eroticism gate”, “sex picture”, and “surreptitious photograph” may be scanned to find a post related to “eroticism gate”. By scanning text contents of the post, once any of the mentioned keywords is found in the text contents, it would be decided that the contents contain sensitive information related to “eroticism gate”. However, during the posting of text contents in practice, some users would purposely “subtly” modify the text contents to be posted in order to avoid censorship and filtering. Taking the keyword “eroticism gate” as an example, a user can modify the keyword “eroticism gate” in the text content to be posted to variants such as “eroX gate”, “ero ◯ gate”, “ero tici sm gate”, “ero×ticism×gate”, “erox0tici0sm gate”, “ero*****ticism**************** gate”.etc. Although these variants could have no influence on reader's understanding of the text content, they can easily be skipped by the scanning sensitive information in the text contents based on keyword scanning techniques in the prior arts. Then, the eroticism, fraud and politically sensitive information could be successfully posted, resulting in the failure of the scanning techniques based on keyword content in the prior arts.
In consideration of the aforementioned problems, the present invention is proposed to provide a system and a method for filtering keywords that will overcome above problems, or at least partially solve or relieve above problems.
According to one aspect of the present invention, there is provided a system for filtering keywords, which comprises:
According to another aspect of the present invention, there is provided a method for filtering keywords, which comprises steps of:
According to another aspect of the present invention, there is provided a computer program which comprises computer readable codes, wherein a server executes the method for filtering keywords according to any one of claims 9-16 when the computer readable code is operated on the server.
According to another aspect of the present invention, there is provided a computer readable medium, which stores the computer program according to claim 17.
The beneficial effects of the present invention are:
The present invention may improve identification capability for sensitive information and improve filtering adaptability for the sensitive information by obtaining the character pitch between the keywords in the text content to be filtered and judging the character pitch.
The above descriptions are merely an overview of the technical solution of the present invention. In order to more clearly understand the technical solution of the present invention to implement in accordance with the contents of the specification, and to make the foregoing and other objects, features and advantages of the present invention more apparent, detailed embodiments of the present invention will be provided below.
Through reading the detailed descriptions of the following preferred embodiments, it will be obvious for those skilled in the art to understand all the other benefits and advantages. The drawings are only provided for the purpose of illustrating the preferred embodiments and should not be considered as any limitations on the present invention. Throughout the drawings, the same component will be indicated by the same reference number. In the drawings:
Hereafter, the present invention will be further described in connection with the drawings and the specific embodiments.
Preferably, the keywords would be words or single characters that constitute sensitive information. The preset keyword dictionary stores all the keywords that need to be filtered out.
If the sensitive information is multiple words formed by a plurality of words, for example, when three words “America”, “bus”, “explosion” exist independently, there may not be contained any sensitive information. However, if these three words appear simultaneously in a section of the text, there may be possibly constituted sensitive information. Generally, it is discrete type information without an ordering sequence when the sensitive information is a phrase formed by a plurality of words. In this case, the character pitch is very large, and the sensitive information could still be reflected even when words appear simultaneously throughout an article. In order to recognize the multiple words, in this embodiment, each word in the multiple words is regarded as a keyword; it is assumed that the keyword dictionary contains three keywords: “America”, “bus” and “explosion”, the corresponding preset character pitch of which is 50 and is assumed that a scanning result (the format of scanning result —“keyword”: position) is “bus”: 34, “America”: 48, “explosion”: 57.
The three words of “America”, “bus” and “explosion” all appear in the scanning result and the character pitch between any two words is smaller than 50. Thus, the text content to be filtered are recognized as containing the sensitive information constituted by the three keywords, so as to filter the text content to be filtered or wait for manual review.
Preferably, if the keywords are single characters constituting sensitive information, with reference to
The keyword dictionary also stores a preset ordering sequence of the keywords.
Preferably, the ordering judgment module particularly comprises:
The sensitive information may be a phrase, for example, “eroticism gate”. Generally, it is vector type information with an ordering sequence when the sensitive information is a phrase. The keywords constituting the sensitive information are necessarily ordered in a sequence so as to reflect the sensitive information. Thus, in order to recognize the processed phrase, this embodiment may divide the phrase into single characters and each of characters may be used as keywords. It is assumed that the keyword dictionary contains three keywords: “erotic”, “ism”, “gate”, the corresponding character pitch of which is 10 and the ordering sequence in the keyword dictionary is set as “erotic”, “ism” and “gate”; and it is assumed that the scanning result in the text content to be filtered (the format of scanning result—“keyword”: position) is “ism”: 67, 77, “erotic”: 87, “gate”: 90.
The three keywords “erotic”, “ism” and “gate” all appear in the scanning results. However, the keywords appear in the text content to be filtered in a sequence as: ism (67)-> ism (77)-> eroti (87)-> gate (90) and the format of the sequence—“keyword” (position) does not follow the preset sequence. Thus, the sensitive information “eroticism gate” is not identified in the text content to be filtered.
Moreover, since the phrase may also be expressed by omitting parts of phrase, for example, “erotiX gate”, “eroti◯ gate” that can also reflect the sensitive information, a method having stronger recognition but with a relatively high misjudgment rate can be used to perform the identification. The relevant method is as follows: assuming the keyword dictionary contains three keywords: “erotic”, “ism”, “gate”, the corresponding preset character pitch of which is 10, and the ordering sequence in the keyword dictionary includes: (1) “erotic”, “ism”; (2) “erotic”, “gate”; (3) “ism”, “gate”; and assuming the scanning result in the text content to be filtered (the format of scanning result—“keyword”: position) is: “ism”: 67, 77, “erotic”: 87, “gate”: 90.
The three keywords “erotic”, “ism” and “gate” all appear in the scanning results. However, the keywords in the text content to be filtered in sequence as: “ism” (67)->“ism” (77)->“erotic” (87)->“gate” (90). Upon judging, (2) and (3) of the ordering sequence are satisfied and the character pitch between “erotic” (87)->“gate” (90) is shorter than the preset character pitch. Thus, the sensitive information “eroticism gate” is identified in the text content to be filtered, so that it is necessary to filter the text content to be filtered or wait for manual review.
At Step S202, if the text content to be filtered contains no keywords stored in the preset keyword dictionary, the process may directly end;
At Step S203, if yes, the process may directly end.
Preferably, the keywords would be words constituting sensitive information and the preset keyword dictionary stores all the keywords that need to be filtered out.
Preferably, the keywords would be single characters constituting sensitive information and the preset keyword dictionary stores all the keywords that need to be filtered out.
With reference to
Preferably, the keyword dictionary also stores a preset ordering sequence of the keywords.
Preferably, when judging whether each keyword satisfies the ordering sequence, the method particularly comprises:
It should be noted that, in each component or element of the system according to the present invention, the components or elements are classified logically in terms of the function to be realized. Nevertheless, the present invention is not limited thereto and each component or element can be reclassified and reassembled as necessary. For example, some of components can be assembled into a single component or some of components can be disassembled into more subcomponents.
Each member embodiment of the present invention can be realized by hardware, or realized by software modules running on one or more processors, or realized by the combination thereof. A person skilled in the art should understand that a microprocessor or a digital signal processor (DSP) may be used in practical use to realize some or all the functions of some or all the members of the system for filtering keywords according to the embodiments of the present invention. The present invention may be further realized as some or all the equipments or device programs for executing the methods described herein (for example, computer programs and computer program products). Such a program for realizing the present invention may be stored in computer readable medium, or may have one or more signal forms. These signals may be downloaded from the Internet website, or be provided by carrying signals, or be provided in any other manners.
For example,
The terms “one embodiment”, “an embodiment” or “one or more embodiment” used herein means that, the particular feature, structure, or characteristic described in connection with the embodiments may be included in at least one embodiment of the present invention. In addition, it should be noticed that, for example, the wording “in one embodiment” used herein is not necessarily always referring to the same embodiment.
A number of specific details have been described in the specification provided herein. However, it should be understood that the embodiments of present invention may be implemented without these specific details. In some examples, in order not to confuse the understanding of the specification, the known methods, structures and techniques are not shown in detail.
It should be noticed that the above-described embodiments are intended to illustrate but not to limit the present invention, and alternative embodiments can be devised by the person skilled in the art without departing from the scope of claims as appended. In the claims, any reference symbols between brackets form no limit to the claims. The wording “comprising” is not meant to exclude the presence of elements or steps not listed in a claim. The wording “a” or “an” in front of element is not meant to exclude the presence of a plurality of such elements. The present invention may be realized by means of hardware comprising a number of different components and by means of a suitably programmed computer. In the unit claim listing a plurality of devices, some of these devices may be embodied in the same hardware. The wordings “first”, “second”, and “third”, etc. do not denote any order. These wordings can be interpreted as names.
Also, it should be noticed that the language used in the present specification is chosen for the purpose of readability and teaching, rather than for the purpose of explaining or defining the subject matter of the present invention. Therefore, it is obvious for an ordinary skilled person in the art that modifications and variations could be made without departing from the scope and spirit of the claims as appended. For the scope of the present invention, the disclosure of present invention is illustrative but not restrictive, and the scope of the present invention is defined by the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
2012 1 0218551 | Jun 2012 | CN | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2013/075649 | 5/15/2013 | WO | 00 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2014/000519 | 1/3/2014 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
5138668 | Abe | Aug 1992 | A |
8713007 | Korolev | Apr 2014 | B1 |
20050060273 | Andersen | Mar 2005 | A1 |
20050138109 | Redlich | Jun 2005 | A1 |
20080178302 | Brock | Jul 2008 | A1 |
20090034851 | Fan | Feb 2009 | A1 |
20100082332 | Angell | Apr 2010 | A1 |
20100082592 | Ruvini | Apr 2010 | A1 |
20100268628 | Pitkow | Oct 2010 | A1 |
20100299322 | Zhang | Nov 2010 | A1 |
20110173037 | Attenberg | Jul 2011 | A1 |
20120096514 | Tuscano | Apr 2012 | A1 |
20120221588 | Wen | Aug 2012 | A1 |
Number | Date | Country |
---|---|---|
1350246 | May 2002 | CN |
1403965 | Mar 2003 | CN |
101472250 | Jul 2009 | CN |
102779176 | Nov 2012 | CN |
Number | Date | Country | |
---|---|---|---|
20150339378 A1 | Nov 2015 | US |