The present disclosure relates to techniques to detect and intervene against counterfeit products and/or fake assets online.
The sale of counterfeit products online costs enterprises (including companies and individuals alike engaged in commerce) billions of dollars annually. Products such as medicines, luxury goods, and hardware are heavily counterfeited online, making this an urgent topic for several industries. This problem damages enterprises at multiple levels. It not only impacts their sales and revenues but also affects their brand and customer relationships. In segments such as pharma or hardware, counterfeit products sold online can cause health and safety issues, with negative connotations for the brands. Enterprises are forced to expend considerable effort, time, and money to take down websites impersonating authentic enterprises selling counterfeit products. Conventional techniques to combat counterfeit products online heavily depend on manual intervention, which is slow and costly. This is due to the absence of truly automated solutions capable of searching, classifying and taking actions against counterfeit products and/or fake assets online.
A method is performed by a management system configured to communicate with one or more networks. The method includes performing a focused search to find uniform resource identifiers (URIs) of potentially counterfeit products and/or fake assets online based on brand assets terms associated with an enterprise, and performing a search for adjacencies of blacklist URIs of counterfeit products and/or fake assets online to find additional URIs of potentially counterfeit products and/or fake assets online that are related to the blacklist URIs. The method also includes adding the URIs and the additional URIs to a URI list, and classifying, by a machine learning classifier, each URI on the URI list as one of a blacklist URI of counterfeit products and/or fake assets online, and a whitelist URI of authentic products and/or assets online that are not counterfeit or fake. The method also includes repeating (i) the performing the search for adjacencies using blacklist URIs resulting from the classifying, (ii) the adding, and (iii) the classifying, to cause the URI list, a number of blacklist URIs, and a number of whitelist URIs to expand from the repeating, and removing access to counterfeit and/or fake assets online revealed by the focused search, the search for adjacencies, or the classifying.
The sale of counterfeit products online costs enterprises over $300 B annually. Products such as medicines, luxury goods, and hardware are heavily counterfeited online, making this an urgent topic for several industries. This problem damages enterprises at multiple levels. It not only impacts their sales and revenues but also affects their brand and customer relationships. In segments such as pharma or hardware, counterfeit products sold online can cause health and safety issues, with negative connotations for the brands. As an example, only during their last fiscal year, Novartis (pharma) took down more than 7,000 websites, including impersonated pharmacies as well as ecommerce sites selling counterfeit medicines. Conventional techniques to combat counterfeit products online heavily depend on manual intervention, which is slow and costly. This is due to the absence of truly automated solutions capable of searching, classifying and taking actions against counterfeit products and/or fake assets online.
These problems are becoming much bigger with the malicious use of artificial intelligence (AI) and automated attacks. Advances in Natural Language Processing (NLP) are particularly concerning, since they allow targeting consumers individually, by using personalized messages via social networks, search engines or emails. Advances on this front not only facilitate the creation of counterfeit offerings and fake assets online, but also allow fraudulent actions to scale out. Indeed, one of the reasons why conventional techniques to combat such counterfeiting are often ineffective and require manual intervention is because perpetrators rapidly move their counterfeit sites and/or products to a new online domain using automated mechanisms.
Conventional techniques lack mechanisms to correlate specific content and associate that content to the sale of the same counterfeit products online. Thus, if an enterprise is able to detect a counterfeit product online, instead of activating a procedure to taking it down immediately, manual processes are triggered to gather information about the sellers, correlate them with other websites and offerings online with the hope of preventing future frauds from the same sources. This process uses tedious, slow, and heavily manual investigations before taking any action.
Solutions in this technology space can be grouped roughly into the following four categories:
Accordingly, embodiments presented herein include techniques to automatically find, classify, and take action against counterfeit products and/or fake assets online that overcome the above mentioned problems and disadvantages associated with conventional techniques, and that offer advantages over the conventional techniques. To this end, the techniques presented herein combine machine learning and system wide network threat intelligence data to: automatically perform a focused search to find potential counterfeit products and/or fake assets online; automatically analyze and classify the findings of the focused search, and discover new adjacencies (i.e., new potential threats based on the findings); and automatically take actions to bring down or otherwise deny (i.e., remover or block) access to the counterfeit products and/or fake assets online. The term “online” is generally understood to mean accessible, or performed, over one or more communication networks, including, e.g., the Internet.
The techniques employ a focused search. The focused search finds potential counterfeits by starting a search from the communication channels usually used to attract potential consumers (e.g., spam email, Black hat search engine optimization techniques, targeted social media sites and messages, and the like). This can be enabled through the use of threat intelligence data (e.g., DNS related domain data and metadata, search engine optimization techniques and backlink exploration, and time and event correlation related to domains), which is collected from multiple sources and networks and aggregated at a system level (e.g., such as the data available through a threat intelligence platform, by Cisco). The search is expanded and complemented through the use of traditional search platforms (e.g., Google, Bing, and the like) and web crawling techniques.
Following the focused search, the techniques also analyze and classify findings of the focused search, and discover new adjacencies, which represent additional findings related to the findings of the focused search. To do this, the techniques gather or otherwise accesses network threat intelligence data and leverage that data both for steering the focused search as well discovering new adjacencies. Reciprocally, threat intelligence data stores can be expanded as a result of analysis and discovery of new webpages that represent adjacencies and new data specifically related to counterfeit products and/or fake assets online.
Referring first to
Clients 102 and servers 104 connect with each other over network 110, and then typical network traffic flows between the endpoints. For example, responsive to requests from clients 102, servers 104 may transfer data stored locally at the servers to the clients over network 110, as is known. Servers 104 typically transfer data to clients 102 as sequences of data packets, e.g., IP packets. Management system 108 may connect with clients 102 and/or servers 104 through network 110. Management system 108 may take on a variety of forms, including that of client device, a server device, a network router, or a network switch configured with computer, storage, and network resources sufficient to implement the techniques presented herein.
Authentic enterprises (i.e., trusted, genuine enterprises that are not bad actors) may own and/or operate various ones of servers 104, and may host on those servers their authentic (i.e., genuine) products and/or assets online (also referred to as “online products and/or assets”) that are not counterfeit and/or fake. “Assets” may include websites, webpages, names of executives/managers of an enterprise (e.g., a fake John Doe facebook profile), look-alike website names (e.g., “authenticairjordan.com”), media files, marketing collaterals, software files/executables, and the like. Consumers may invoke standard browsers hosted on clients 102 to perform online searches for and to access the authentic products and/or assets online hosted on servers 104 over network 110. On the other hand, bad actors, such as counterfeiters, may orchestrate deployment of counterfeit products and/or fake assets online on other ones of servers 104. Consumers may also access the counterfeit products and/or fake assets online over network 110 via clients 102. Often, the consumers cannot differentiate between the authentic products and/or assets online and the counterfeit products and/or fake assets online. Management system 108 primarily interacts with repository 106 and network 110 to automatically find, classify, and take corrective actions against the counterfeit products and/or fake assets online that are hosted on the above-mentioned servers, for example, as described above and further below.
With reference to
At 202, information to be used in a focused search (described below) is collected. The information includes brand asset terms relevant to the enterprise. The brand asset terms may be collected in an automated manner, e.g. by integrating with software products for Enterprise Resource Planning (ERP)/Customer Relationship Management (CRM), and scraping authentic domains. Also, the brand asset terms may be collected and entered manually, e.g., using a graphical user interface (GUI) by which an administrator of management system 108 may enter the brand asset terms. Examples of brand asset terms include product names (e.g., Cisco Meraki), brand names (e.g., Ralph Lauren, which is a brand name relevant to the L'Oreal group), and domain names (e.g., https://www.ralphlauren.es/en). The information collected at 202 also includes the set of keywords and the relevant dictionaries from keyword store KS. The information collected at 202 may be formatted as an information matrix (e.g., a one-dimensional, two-dimensional, or three-dimensional matrix) for presentation/input to next operation 204.
At 204, a focused search on the information collected at 202, i.e., the brand asset terms relevant to the enterprise, the set of keywords, and the relevant dictionaries, is performed. In an example, the focused search may include a search on/over a network, i.e., the focused search may include an online focused search. The focused search finds/detects uniform resource identifiers (URIs) of potentially counterfeit and/or fake assets online. As used herein, the term URI refers to, for example, an identifier of a resource accessible over a network. Examples of URIs include, but are not limited to, domain names, uniform resource locators (URLs), webpages served on the World Wide Web (WWW), or product pages of specific e-commerce platforms, such as Amazon.com. The URIs found by the focused search are stored/added in/to a URI list 206 for subsequent processing. At a high-level, the focused search searches through mechanisms and channels commonly used between consumers and counterfeiters to attract the consumers online, which mechanisms and channels may be accessible in network threat data 214, for example. Examples include data from spam email, search engine results generated using Black hat search engine optimization techniques, targeted social media posts, marketplace search application programming interfaces (APIs), and so on. Thus, the focused search is performed not only on traditional search engines such as Google or Bing, but also using network threat data 214 (e.g., which stores spam email lists or phishing website lists). The focused search may use any of a plurality of methods including indexing spam email, social media search, and so on.
More specifically, the focused search may include (i) search engine optimization, (ii) social media Bot detection, and (iii) an NLP-derived dictionary, each described below. Other channels, such as spam emails or phishing website-lists can be combined with techniques (i), (ii), and (iii). A Bot (e.g., an “Internet Bot” or “web robot”) is an autonomous software application that runs automated tasks over a network, e.g., the Internet, as is known.
An example of the search engine optimization may include the following sequence of operations:
An example of social media Bot detection may include the following sequence of operations:
e. Recursively crawl links from (d) (e.g., by following web-redirection URLs, links embedded in pages, and so on) to find additional URLs.
An example of the NLP-derived dictionary includes applying NLP-based tokenization of an enterprise's digital assets (e.g., finding common nouns and proper nouns within the authentic marketing website) to create a dictionary of keywords that are used as seeds for the above-described social media Bot detection.
At 208, URI exploration is performed. To do this, the URIs in URI list 206 identified by the focused search are explored and data is collected from the URIs (e.g., by web crawling the URIs using web crawlers or through the APIs provided by marketplaces). The exploration of the URIs may lead to additional URIs of potentially counterfeit and/or fake assets that need to be explored further. The additional URIs are also stored in URI list 206 to expand the URI list. Raw data collected from URI exploration, including the additional URIs, as well as other data, such as metadata, indexes, and so on, is stored in a raw data database 210, for subsequent processing.
At 212, the raw data stored in raw data database 210 is complemented with additional data produced by network threat analysis. The network threat analysis operates on network threat data 214. For example, the network threat analysis produces data that includes DNS registration/query information, matching of URIs in spam emails, and so on, that enhances the raw data from 208.
At 216, an ML classifier classifies each URI on URI list 206, complemented with the raw data of raw database 210, as one of: (i) a URI of counterfeit products and/or fake assets online, referred to as a blacklist URI; (ii) a URI of authentic products and/or assets online that are not counterfeit or fake, referred to as a whitelist URI; and (iii) a URI of products and/or assets online that the ML classifier is unable to classify as either a blacklist URI or a whitelist URI due to lack of classifying information/knowledge with respect to the URI, referred to as an “undetermined URI” or a URI having an unknown status, for which further analysis is needed. An example ML classifier is described below in connection with
It is assumed that the ML classifier is trained initially in an a priori training stage, i.e., prior to operation 216. In the a priori training stage, training data is supplied to inputs of the ML classifier to train the ML classifier. The training data typically includes labels derived from URIs of authentic products and/or assets online associated with respective indicators/tags to indicate authenticity, and labels derived from URIs of counterfeit products and/or assets online associated with indicators/tags to indicate counterfeit or fake status. After the a priori training stage, during runtime/real-time classifying at 216, the ML classifier employs supervised learning using feedback of the whitelist URIs that result from classifying at 216 as training labels.
Examples of ML classifiers that may be trained for use in classifying at 216 include individual URI classifiers, and network clustering classifiers. Individual URL classifiers take as input a single URL and produce an output that can be interpreted as a probability that the URL aims to sell a counterfeit product and/or fake asset. The URLs provided as input to these classifiers are provided as output from the focused search at 204, a search for adjacencies (operation 226, described below), and the recursive web crawling at 208. Features that may be extracted from the input URIs, and that form the basis for classifying the URIs, include items such as images on a webpage indicated by the URI, price lists, DNS data for webpages including authorization and query logs, and so on. These classifiers may use any known or hereafter developed ML algorithm to classify the URIs, including logistic regression, support vector machines, decision trees, and so on.
On the other hand, network clustering classifiers take as input a set of (potentially suspicious) URIs and output indicator that the URIs belong to one or more clusters operated by the same counterfeiter (e.g., given a set of social media Bots, the classification can determine if a subset of these Bots belong to the same agent). The inputs to these classifiers include the output of the individual URI classifier described above, and the search for adjacencies mentioned above. Examples of such classifiers include nearest neighbor classifiers (e.g., that use locally-sensitive hashing), unsupervised clustering (e.g. K-means), maximal subsequence matching algorithms, and so on.
At 218, it is determined whether the ML classifier classified the URI presented at the input to the classifier as a blacklist URI, a whitelist URI, or an undetermined URI. When/if it is determined that the URI is a blacklist URI, a whitelist URI, or an undetermined URI, the URI is stored in a blacklist 220 of URIs classified as blacklist URIs, a whitelist 222 of URIs classified as whitelist URIs, or an undetermined/unknown list 224 of URIs which the classifier was unable to classify as either a blacklist URI or a whitelist URI, respectively, for subsequent processing.
The output of the ML classifier, i.e., classification decisions, are used take one or more of four possible actions:
At 226, the above-mentioned search for adjacencies is performed. In an example, the search for adjacencies may include a search that is performed on/over a network, i.e., the search for adjacencies may include an online search for adjacencies. The search for adjacencies searches the blacklist URIs and the undetermined URIs to find additional URIs of potentially counterfeit products and/or fake assets online that are related to the blacklist URIs and the undetermined URIs. The search for adjacencies also finds network relationships between multiple nodes that belong to the same counterfeiter, so that this found “network of the counterfeiter's URIs” may be blocked/stopped instead of having to block many counterfeiting URIs individually. The additional URIs (i.e., adjacencies) found by the search for adjacencies are added to URI list 206 to expand the URIs that are to be subjected to the (ML) classifying at 216.
As shown in
The search for adjacencies may include searches for backlink-based adjacencies and DNS based adjacencies. A search for backlink-based adjacencies includes the following operations:
A search for DNS-based adjacencies includes the following operations:
At 228, automated intervention is performed based on the blacklist URIs accessed in blacklist 220. Such intervention removes or blocks online access to the blacklist URIs, which prevents the sale of counterfeit products and fake advertising, for example. Several possible methods of intervention may be used, including domain takedowns, payment notifications, DNS and web filtering, and so on. Also, a message may be sent to an administrator of management system 108 indicating the blacklist URIs, for display on a GUI at the management system. Successful intervention makes blacklist URIs no longer accessible online at a later stage in time, which also changes the results of the focused search, e.g., the blacklist URIs subject to successful intervention may be removed from URI list 206 (if a search of the URI list finds such blacklist URIs in the URI list) (indicated in
Method 200 includes a second automatically repeating loop, including the following operations: classifying 216; feedback of blacklist URIs to network threat data 214; and network threat analysis 212 feeding data to raw data database 210 to compliment the URIs from URI list 206 then presented to classifying 216. The second automatically repeating loop expands and refines network thread data 214 and network threat analysis 212, to produce more accurate information to the classifying. The automatically repeating loop that incorporates the search for adjacencies 226 and the second automatically repeating loop together represent interrelated automatically repeating loops that enable two different counterfeit and threat analysis to work in concert, to improve automatically finding, classifying, and taking corrective action against counterfeit products and/or fake assets online. Thus method 200 represents a closed-loop method that incorporates multiple repeating loops that improve outcomes.
With reference to
At 302, management system (MS) 108 automatically collects brand assets terms associated with an enterprise, and performs a focused search to find URIs of potentially counterfeit products and/or fake assets online based on the brand assets terms.
At 304, MS 108 performs a search for adjacencies (also referred to as an “adjacencies search”) of blacklist URIs of counterfeit products and/or fake assets online to find additional URIs of potentially counterfeit products and/or fake assets online that are related to the blacklist URIs.
At 306, MS 108 performs online web crawling of each URI found by the focused search and the search for adjacencies to find URIs of potentially counterfeit products and/or fake assets online.
At 308, MS 108 adds to a URI list of URIs (of potentially counterfeit and/or fake assets online) the URIs found by the focused search, the search for adjacencies, and the online web crawling.
At 310, MS 108 classifies, by a machine learning classifier, each URI on the URI list as one of a blacklist URI of counterfeit products and/or fake assets online, a whitelist URI of authentic products and/or assets online that are not counterfeit or fake, and an undetermined URI when the URI cannot be classified as either a blacklist URI or a whitelist URI. The whitelist URI is fed-back to a supervised training input of the ML classifier, to train the ML classifier.
At 312, MS 108 automatically repeats operations: (i) the performing the search for adjacencies using blacklist URIs resulting from the classifying; (ii) the adding; (iii) the online web crawling; and (iv) the classifying; to cause the URI list, a number of blacklist URIs, and a number of whitelist URIs to expand as a result of successive iterations of the repeating.
At 314, MS 108 automatically removes access to counterfeit and/or fake assets online found by the focused search, found by the search for adjacencies, or identified by the classifying (i.e., revealed by the aforementioned operations). For example, MS 108 automatically intervenes against the blacklist URIs to remove network access to the blacklist URIs.
MS 108 repeats the operations of method 300.
With reference to
The processor(s) 410 may be a microprocessor or microcontroller (or multiple instances of such components). The NIU 412 enables management system 108 to communicate over wired connections or wirelessly with a network. NIU 412 may include, for example, an Ethernet card or other interface device having a connection port, which enables management system 108 to communicate over the network via the connection port. In a wireless embodiment, NIU 412 includes a wireless transceiver and an antenna to transmit and receive wireless communication signals to and from the network.
The memory 414 may include read only memory (ROM), random access memory (RAM), magnetic disk storage media devices, optical storage media devices, flash memory devices, electrical, optical, or other physically tangible (i.e., non-transitory) memory storage devices. Thus, in general, the memory 414 may comprise one or more tangible (non-transitory) computer readable storage medium/media (e.g., memory device(s)) encoded with software or firmware that comprises computer executable instructions. For example, control software 416 includes logic to implement methods/operations relative to management system 108, and logic to implement an ML classifier as described herein, such as methods 200 and 300. Thus, control software 416 implements the various methods/operations described above. Control software 416 also includes logic to implement/generate for display GUIs in connection with the above described methods/operations. Memory 414 also stores data 418 generated and used by control software 416, such as blacklists, whitelists, a URI list, keywords and dictionaries, raw data, network threat data, and GUI information as described herein.
A user, such as a network administrator, may interact with management system 108, to display indications and receive input, and so on, through GUIs by way of a user device 420 (also referred to as a “network administration device”) that connects by way of a network with management system 108. The user device 420 may be a personal computer (laptop, desktop), tablet computer, SmartPhone, etc., with user input and output devices, such as a display, keyboard, mouse, and so on. Alternatively, the functionality and a display associated with user device 420 may be provided local to or integrated with management system 108.
With reference to
At 502, training files TF are provided to a training input of ML classifier 501 in its untrained state. The training files TF may include a variety of training labels. The training labels include (i) artificial and/or actual URIs of genuine products and/or assets online along with associated indicators/tags that identify the URIs as genuine, and (ii) artificial and/or actual URIs of counterfeit products and/or fake assets online along with associated indicators/tags that identify the URIs as counterfeit. ML classifier 501 trains on the training files TF to recognize the URIs as either genuine or counterfeit.
At 506, (corresponding to classifying operation 216 described above) real-time URIs from the repeating loops of method 200 (and 300) are provided to ML classifier 501 that was trained at 502. Trained ML classifier 501 makes classification decisions to classify each URI as one of a blacklist URI, a whitelist URI, or an undetermined URI. Each whitelist URIs is fed-back to a supervised training input of ML classifier 501 to train the ML classifier during run-time.
In summary, presented herein are techniques directed to automatically finding, classifying, and taking corrective actions against counterfeit products and/or fake assets online. The techniques combine online data collection using focused search of attack vectors used to distribute attach vectors used to distribute information regarding and advertising counterfeit products, network threat (intelligence) data, and machine learning algorithms to perform detection, and take-action against the advertisement and sale of counterfeit products and other fake assets online, that may be discovered on websites, marketplaces, the dark web, social sites, phishing and spam email lists, and so on, by the techniques presented herein.
In summary, in one form, a method is provided comprising: at a management system configured to communicate with one or more networks: performing a focused search to find uniform resource identifiers (URIs) of potentially counterfeit products and/or fake assets online based on brand assets terms associated with an enterprise; performing a search for adjacencies of blacklist URIs of counterfeit products and/or fake assets online to find additional URIs of potentially counterfeit products and/or fake assets online that are related to the blacklist URIs; adding the URIs and the additional URIs to a URI list; classifying, by a machine learning classifier, each URI on the URI list as one of a blacklist URI of counterfeit products and/or fake assets online, and a whitelist URI of authentic products and/or assets online that are not counterfeit or fake; repeating (i) the performing the search for adjacencies using blacklist URIs resulting from the classifying, (ii) the adding, and (iii) the classifying, to cause the URI list, a number of blacklist URIs, and a number of whitelist URIs to expand from the repeating; and removing access to counterfeit and/or fake assets online revealed by the focused search, the search for adjacencies, or the classifying.
In another form, an apparatus is provided comprising: a network interface unit to communicate with a network; and a processor coupled to the network interface unit and configured to perform: performing a focused search to find uniform resource identifiers (URIs) of potentially counterfeit products and/or fake assets online based on brand assets terms associated with an enterprise; performing a search for adjacencies of blacklist URIs of counterfeit products and/or fake assets online to find additional URIs of potentially counterfeit products and/or fake assets online that are related to the blacklist URIs; adding the URIs and the additional URIs to a URI list; classifying, by a machine learning classifier, each URI on the URI list as one of a blacklist URI of counterfeit products and/or fake assets online, and a whitelist URI of authentic products and/or assets online that are not counterfeit or fake; repeating (i) the performing the search for adjacencies using blacklist URIs resulting from the classifying, (ii) the adding, and (iii) the classifying, to cause the URI list, a number of blacklist URIs, and a number of whitelist URIs to expand from the repeating; and removing access to counterfeit and/or fake assets online revealed by the focused search, the search for adjacencies, or the classifying.
In a further form, a non-transitory computer readable storage medium is provided. The computer readable medium is encoded with instructions, that when executed by a processor, are operable to perform performing a focused search to find uniform resource identifiers (URIs) of potentially counterfeit products and/or fake assets online based on brand assets terms associated with an enterprise; performing a search for adjacencies of blacklist URIs of counterfeit products and/or fake assets online to find additional URIs of potentially counterfeit products and/or fake assets online that are related to the blacklist URIs; adding the URIs and the additional URIs to a URI list; classifying, by a machine learning classifier, each URI on the URI list as one of a blacklist URI of counterfeit products and/or fake assets online, and a whitelist URI of authentic products and/or assets online that are not counterfeit or fake; repeating (i) the performing the search for adjacencies using blacklist URIs resulting from the classifying, (ii) the adding, and (iii) the classifying, to cause the URI list, a number of blacklist URIs, and a number of whitelist URIs to expand from the repeating; and removing access to counterfeit and/or fake assets online revealed by the focused search, the search for adjacencies, or the classifying.
Although the techniques are illustrated and described herein as embodied in one or more specific examples, it is nevertheless not intended to be limited to the details shown, since various modifications and structural changes may be made within the scope and range of equivalents of the claim.