Extraction of unique identifiers such as mobile device advertisement identifier, mobile application identifier, Publisher identifiers used by CDN or cloud providers, session identifier etc., and association of these identifiers with a subscriber identifier (operator IMSI or IMEI), device-type & application from data collected and co-related from multiple sources within the Operator network such as user plane network traffic, control plane network traffic, flow-logs from operator network devices (web server, transit web-cache/proxy, GGSN/PGW, Packet Probe/DPI devices, RADIUS Server), subscription/service plan data is the subject matter of the current invention.
Further, the current invention uses self-learning and auto-tuning to learn, validate, discard and update the identifiers associated with semi-permanent entities (subscriber, site, application etc.), thus maintaining the accuracy of the identifiers associated with the semi-permanent entities on a continuous basis. Additionally, the invention computes a confidence level for each identifier associated with a semi-permanent entity. The confidence level facilitates the receiving system that receives these identifiers & corresponding associations to use its own methods (outside of the scope of current invention) to pick and apply the best identifier suitable for its application.
The Identifier association with semi-permanent entities (subscriber, web-site, application etc.) is made available to consuming or receiving systems for applications including but not limited to monetization of data by advertisements, sponsored data, service plan promotions, monitoring/usage reports, content selection and delivery, QOE optimizations via APIs and/or pre-defined formatted files.
Identifiers such as subscriber-id (IMSI, MS-ISDN) that are useful in categorizing user demographics, browsing patterns etc., are very useful for Advertisers to sell targeted advertisement. Many of the service providers on the internet, such as Mail Service, YouTube etc., that offer free services get revenues by selling advertisements when their service is used by a subscriber. However, making subscriber identifiers visible on the internet violates user's privacy, since the permanent subscriber identifier such as IMSI, phone number etc., has significant subscriber information throughout the internet and many businesses. Thus, mobile device vendors such as Apple, Google etc., assign their own relatively dynamic identifiers that correspond to a subscriber for longer periods; such IDs are resettable by subscriber and/or device vendor. Apple calls them as “IDFA”, whereas Google calls them as ADIDs in their devices. These are termed as “ADIDs”, in the current invention. Thus, the scope of such identifier is the device vendor, and specific Device/OS/Application releases. Thus, Apple's IDFAs are independent of Google's ADIDs and these identifiers are limited in scope thus overcoming issues with privacy protection. Additionally, an application vendor such as Google that sells applications to both i-phones, and android phones could use ADID's, when their applications are active—thus a device such as I-phone may have both IDFA and ADID. Learning ADIDs associated with a more permanent subscriber such IMSI from the traffic exchanged through the mobile network by developing insights and generating a learning algorithm is one of the key subject matters of the current invention.
Similar to ADID, app store vendors such as “I-Tunes”, Google “AppStore”, use identifiers that are unique in their Appstore to uniquely identify an application; app vendors communicate this identifier while communicating through the internet. Identifying the app-id in the device communication facilitates the context of the application for traffic to/from the device in a given period of time. While every packet to or from the device may not have the specific app-id in clear (without encryption), identifying up/down packets in a given period and associating with an app-id, facilitates characterizing the specific behavior, detecting anomalies, behavior changes newer version are released, observing and predicting usage patterns facilitates a number of benefits to mobile operators, device & application vendors. Additionally, devices typically contain generic application such as a browser that facilitates searching, and/or reaching web-sites without requiring download of native applications to access websites. Also, many websites may not have a unique client application and reachable via http or https or other protocols using SAFARI, FIREFOX, Internet Explorer, Chrome etc., browsers using W3C semantics. Identifying and separating traffic from browser (along with specific browser) vs non-browser (any native app) from learning insights of browser access patterns, and information contained in the packet exchanges is another embodiment of the current invention.
Identifying other unique identifiers for specific use, such as, cloud-id, CD N-ID, that are assigned by a specific service provider, and associating with the corresponding clients are additional embodiments of the current invention.
Identifiers on the internet come in variety of compositions and lengths. Most commonly used identifiers on the internet are UUIDs as defined in RFC-4122 which are 32-Hex Characters long and take one of the following forms:
Mobile Advertisers use UUIDs for tracking and delivering targeted mobile advertisements to mobile devices—both phones and tablets. Such UUIDs compliant with RFC-4122 are referred sometimes with different names depending on their usage, for example, as IDFA on Apple Devices and ADID on Android Devices. Collectively, IDFA and ADID, are referred to as “Ad-Id” in this document. Such Ad-Ids are used by one or more applications while requesting mobile advertisements so that the Publisher's application server can use it to either request server-side ads and embed the mobile Ad content within the content it serves or forward it to a third-party Ad-Manager to serve a targeted Ads.
Similar to Ad-Ids, App stores like iTunes and Google Play use java package names or Appstore specific identifiers to identify and track individual mobile applications and their versions downloaded by subscribers. Such Appstore identifiers are referred to as “App-Id” in this document.
Similarly, CDN (Content Delivery Networks) and Cloud providers use their own scheme of identifiers to identify the Content Publishers whose content is cached or prefetched or hosted and delivered from their network s. Such ids are referred to as “Cloud-Id” in this document. These identifiers may or may not be globally unique and may not use the RFC4122 format, since they need to be unique in their own domains.
It is important to note that while the “unique identifiers in the current invention” refers to the Universally Unique identifiers per RFC4122, the invention is equally applicable to unique ids used by a website or app-store, cloud environment etc., with a form defined by that provider to identify distinct clients, apps, sites etc.
Current invention extracts Ad-IDs, App-Ids, Cloud-Ids etc., from correlated multi-dimensional data, and where applicable, classifies them based on behavioral category of servers receiving or transmitting these on the internet, associates these Ids to individual subscribers (or apps or sites) in near Real-Time and automatically re-associates these Ids to the corresponding “Key Entities” (KE) even if and when the device user or server update the Ids.
All user flows containing Ad-Ids, App-Ids, Cloud-Ids etc., are communicated between the subscriber's mobile device and the Publisher server (Appstore/Advertiser) either via HTTP or encrypted protocols including but not limited to HTTPS/TLS. When these Ids are communicated using encrypted protocols such as HTTPS/TLS, the App-Ids or Ad-Ids are not directly visible to the transit network device or a packet capture/DPI device and cannot be observed or extract ed. Extraction and Identification of Ad-Ids, App-Ids and Cloud-Ids from encrypted user flows is outside the scope of the current invention. However, the ID exchanged by user device within the encrypted protocol, may appear in other exchanges to/from the user device without encryption with other keywords or tags in other protocols such as HTTP; identifying these and associating them with Key Entities (KEs) is one of the subject matters of the current invention.
This section provides a detailed description of the invention and underlying mechanisms for extracting and identifying the Identifiers from multiple streams of data.
UUID Extraction—When determining a particular UUID that matches the RFC 4122 pattern, the following aspects need be considered:
In general, a device may communicate with certain domains/sites/web-pages on the internet with or without encrypted protocols in varying pro port ions. Since current invention does not attempt to decrypt the traffic to extract or identify an Ad-Id from the encrypted traffic, the system takes varying amount of time to extract, verify and assign an Ad-Id to a subscriber. As each subscriber's ad-ids is learned per the current invention methods, the number of subscribers with unknown Ad-Ids decreases with time. While the system does not decrypt traffic from encrypted protocols such as HTTPS, it uses un-encrypted content of such protocols (for example during initial exchanges while establishing secure tunnels), or contextual or temporal association (e.g., DNS prior to HTTPS connection, HTTP content from the same user IP address during encrypted content exchanges) between encrypted and un-encrypted protocols.
The steps involved in identifying Ad-Ids, App-Ids and Cloud—Ids are outlined below. Steps 1-3 are common to extraction of Ids from multi-dimensional data whereas the remaining steps are specific to a class of IDs:
6.1. Ad-ID to Subscriber ID Mapping
Subscriber ID to Ad-Id is performed in 3 steps:
6.2. Ad-Id Algorithm Version 1
Ad ID seen in URL generally look like
http://host.com? key=junk . . . & keyword=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx . . .
Where “x” is hex char (upper or lower case)
May be first parameter (after“?”) or subsequent (after“&”)
May use “%3D” instead of“=”
May be more than one parameter of this form per URL, so have to check them all
Ad ID form is generally (always?) RFC-4122 compliant
version 1 (time/node based): xxxxxxxx-xxxx-1xxx-Rxxx-xxxxxxxx
version 4 (random): xxxxxxxx-xxxx-4xxx-Rxxx-xxxxxxxx
“R” is 8, 9, A, or B, since the top two bits as “10” indicated RFC-4122 compliance
We have only observed version 1 and 4, RFC-4122 Ad IDs, so far.
Type-1 includes a 6 byte MAC address and 60-bit timestamp. The type-1 timestamps we have observed are generally distributed within the previous year (indicating that Ad ID lifetime is probably <1 year).
We haven't analyzed MAC addresses, or whether ID version depends on OS or device type.
Type 4 includes a 122-bit random value and nothing else.
Using Hadoop, search all available HTTP records for every query parameter that has the right form (UUID),
save the subscriber, domain, keyword name and potential Ad ID, and aggregate all (subscriber|adid) pairs seen with each (domain|parameter) pair:
imrworldwide.comlts 9781234567|d7267c6f-6f35-4b51-9eaf-41333100ef66,7815551212|d7267c6f-6f35-4b51-9eaf-41333100ef66
radiotime.com|idfa 9784443232|510f3788-637f-4469-8858-44968c5d4642
artofclick.com|google_aid5085235532|50556478-3db9-405a-a267-28b07202b2ee, 7815551212|d7267c6f-6f35-4b51-9eaf-41333100ef66,
For every domain/query parameter name from step 1, create a set of all the subscriber/Ad ID pairs, discarding any problematic ones (e.g. ones that use“;” to delimit additional sub-parameters within the query parameter), then determine if there is approximately one unique Ad ID for each unique subscriber. If not, discard the data for that domain/query parameter. (see one-to-one analysis later).
Take pairwise combinations of the records from step 2. Each record is identified by a (domain/qp) and contains a list of (subscriber|adid) pairs. For each pair of records, combine their subscriber ladid pairs and test if the combined data is still sufficiently one-to-one, i.e. they don't disagree about which subscribers go with which adids. Only combine records where both patterns see the same subscriber, or both patterns see the same Ad ID. If they match each other, put them in the same group. Keep adding to the various groups as new pairs are analyzed, assuming that if A˜B and A˜C, then A˜B˜C.
Once the groups are determined, combine subscriber/adid pairs for entire group and check one-t o-one. Print out each group and its one-to-one parameters. Typically, one group should stand out by containing a large number of query parameters, and having good statistics. This is the desired group of patterns. As a check, we also combine a few of the final groups to see if the results could be improved. Typically, this will only help if the amount of records analyzed in step 1 was too small.
Ideal: one adid per subscriber. 3 subscribers, 3 adids, 3 mappings.
Error formulas:
The one-to-one function returns four error parameters. The first two are (mappings/subscribers)−1 and (mappings/adids)−1, as described in the one-to-one discussion. These are no longer used due to the fact that they would indicate that 100 mappings with 100 adids (0% error), was just as good as 1 mapping with 1 adid (0% error).
The second two parameters are the same adid and subscriber error parameters with an effort to apply bayesian statistics, which integrates the chance of seeing the observed result over all the possible probabilities. Effectively, the smaller the sample size, the more the error is adjusted. So, 100 mappings with 90 adids would be considered 90 successes with 10 failures to have a unique adid.
Strict division would say the success rate was 90%, but the bayesian success probability is 89.2% (pretty close, since the sample size is kind of high). But for a smaller sample, say 9 successes and no failures, rather than 100% success, we get about 90.9%, indicating that 9 out of 9 scores about the same as 90 out of 100. There are other ways to de-empathize smaller samples (most simply, by discarding them). But this seems to work well especially when the dataset is small and we can't afford to discard data.
The format of the output is the set of (‘domain’, ‘query parameter’) pairs considered part of the same group, followed by the four error estimates of the combined group: frequentist adid error, frequentist subscriber error, bayesian adid error, bayesian subscriber error. We use the bayesian, so for the first group, adid and subscriber error are 9% and 0.5% respectively. That indicates this is probably a meaningful ID within that group of domains, but does not match the real group (which is not shown). The other two groups have errors of 2.7% and 0.4%, and 9.3% and 1.9%. Probably locally meaningful, but not global IDs. The “real” group contained 143 patterns and had errors of 6.2% and 0.4% across all patterns.
Observations from Interface Data (PCAP) with Sample Tests on Previous Algorithm:
1) collect for each domain/query parameter a set of (subscriber, uuid) pairs from clickstream data
2) filter out domain/query parameter tuples which don't comply with the following constraints
1) remove blacklisted uuids and (domain, query parameter) tuples
2) for each uuid, count the number of associated ‘query parameters’ that match well known tags for advertising id.
3) if there were no uuids associated with at least MIN_QP well known tags then declare the election invalid and move to next subscriber
4) the uuid with the most votes is now declared the winner
5) each (domain, query parameter) that voted for the willing uuid is given a win
6) each (domain, query parameter) that voted for to losing uuid is given a loss
7) discard any (domain, query parameter) tuples that had a election loss %>MAX_PCT_ELECTION_LOSS. The remaining set (domain, query parameter) tuples are declared to be credible sources for advertising id.
The above process can be done offline periodically, or continuously with a stream of clickstream records.
The running system will maintain the most likely advertising id for each subscriber. When a new uuid is observed for a subscriber from a set of credible tuples (domain, query parameter), the new uuid is promoted to be the advertising id.
When a new, non-blacklisted, (domain, query parameter) tuple is observed with at least MIN_VOTES subscribers and its election loss percentage is <MAX_PCT_ELECTION_LOSS, the new (domain, query parameter) tuple is promoted to credible status.
When an existing credible (domain, query parameter) tuple loss percentage exceeds MAX_PCT_ELECTION_LOSS for a period, it is demoted from credible st at us. If its loss percentage stays above the MAX_PCT_ELECTION_LOSS for a period of time, the tuple is put on the blacklist.
When a mobile device communicates with servers on the internet (cloud, origin server or DCN), the application on the device may be browser (Safari, Firefox, Internet Explorer, Chrome etc.), or a native application that is downloaded and running on the device. Applications may also use HTTP or HTTPS protocol and may not be distinguishable based on TCP/IP port numbers alone. Also, several browsers integrate search engine. Thus, when a user enters a string into browser tool bar the string is sent to the default search engine that the browser uses, which returns search results; user then selects some sites/links within the search results. This generate access pattern in the user flow data as TCP (HTTP or HTTPS) connection with small uplink traffic, followed by a downloaded page, followed by a sequence of DNS Requests and TCP connections to other domains. Such a dataflow pattern identifies Search+Browser based user accesses. The following steps differentiate between Browser & Non-Browser (Native Applications) based Accesses from a user device:
Some of the unique identifiers extracted from user flows may correspond to application unique identifiers (AppID) that are unique to the specific device type or appstore, for example, i-phone/AppStore may use one format of IDs, and Android a different format. For example, AppIDs by Apple use the format:
A1B2C3D4E5.com.domainnam.e.appname, where, the string “A1B2C3D4E5” is apple assigned, and “com.domainname.appname” is developer assigned, and the two together is termed “AppId”.
After browser accesses are filtered from HTTP/URL flow records, for each device type, domain name, UIDs & associated tags are maintained similar to Ad-Ids in section 6.1. For each UID confidence level is maintained that indicates the probability that UID is an AppID. When a UID is associated with tag-name=“appid” in URL string, confidence level is set to 100%. For each subscriber-id, flows are grouped as sessions based on multi-second idle times. Thus, a user's session may have HTTP, HTTPS, DNS etc. flow records and UIDs & tags will be visible in HTTP URL records.
Thus AppId is the ID for all the flows in that session. When user activates an app on the mobile, it's majority of communication, by volume and/or time duration will be with the webserver. Thus, for each user session, dominant domain names are tracked. If a UID appears in sessions of multiple users, and the dominant domain names (FQDNs) in those sessions are same, that UID is likely to be an Application ID, and the associated confidence level is increased. UIDs with confidence levels greater than 60% are marked as Application IDs. The data collection & analytic system, characterizing application behavior from observed sessions with same ApplicationId.
CDNs use a variety of techniques to steer traffic away from the original website (brand/publisher) onto the content delivery network. These techniques include URL rewrite, HTTP redirection, DNS redirection, and anycast. The method outlined uses a stream of HTTP(S)/URL flow records, a URL classification function, and a list of known CDN URL patterns. It is assumed that the source of the http records will record domain observed from DNS monitoring for https traffic.
The HTTP records are sorted in ascending time order and inspected on a per subscriber level. Each http record is classified according to its URL into a category and subcategory. Categories include ‘Advertising’, ‘Analytics’, ‘CDN’, ‘Software APIs/Service’, etc. Once classified, the record is dropped if it is determined not to be associated with a publisher/brand (Origin Server). For example, ‘Advertising’, ‘Analytics’, ‘Software APIs/Service’. If the record is not associated with a known CDN, then associated brand is captured as the ‘current’ brand for this user. If the record is associated with a known CDN pattern and there is not yet an underlying brand associated with this CON, then the current CDN pattern is associated with the ‘current’ brand and a ‘vote’ for this cdn/brand association is emitted and forwarded using the CDN pattern field as key. If the record is associated with a known CDN pattern as well as a known publisher, the record is dropped.
Once all of the ‘votes’ have been cast for a particular CDN pattern, the next stage of the learning process counts the votes and sorts them in descending order. If there is a clear winner according to the vote count (e.g. 95% of votes), number of unique candidates (e.g. less than X), overall number of votes cast (e.g. greater then Y), bytes/hits observed for the current CDN pattern, then the winner is declared to be the associated brand/publisher for this CDN pattern and the categorization database is updated.
During the election process, if a CDN pattern is found to be associated with an excessive number of brand candidates, each containing a significant vote count, then the URL will be reclassified with a category that is not associated with a publisher/brand and the categorization database will be updated.
Once the CDN association process completes and the current categorization database is updated, the process can be repeated with the same or a different set of data one or more times to increase accuracy of the learning result. A ‘time of learning’ is associated with each learned relationship and can be used to trigger re-verification of previously learned relationships or to remove mapping that have not been observed for a configurable period. The learning process is intended to be run periodically to update the learned relationships.
The intention of the process is to automatically learn the relationship between a CDN provider URL and the underlying content/brand (publisher). The process outlined removed the noise (ads, analytics, software api/services, etc) from the input stream to make the signal (brand/CDN association in time) stronger. This technique employs the effect of the law of large numbers by observing traffic patterns from a very large number of subscribers over space and time to filter the incoming signal.
This section describes specific use cases for each of the Ids extracted.
The AdID or IDFA uniquely identifies a mobile device for delivering mobile advertising. The mobile advertising ecosystem including the mobile applications to the mobile ad delivery and analytics uses the IDFA for ad delivery, tracking and performance tracking purposes. The AdId is transmitted from mobile devices to remote advertising servers as a parameter on HTTP and in some cases HTTPS advertising calls and can be extracted through mobile traffic elements.
Further, the network providers uniquely identify their own subscribers using a hashed version of their own SubId. The SubId or a hashed derivate of this SubId is used by the network providers to transmit/route traffic to/from internet, bill the subscriber for mobile usage. The SubId (or its derivative) remains static over the life of a mobile device. This enables identification and inference of mobile behaviors & the user demographics of the individual mobile devices connecting to the network. The mobile behaviors are extremely valuable for targeting the right mobile advertising to individual mobile devices.
By identifying and extracting AdIds from mobile advertising traffic in particular, correlating them to SubId, and then associating it to historical mobile behaviors & User demography from a mobile device, network providers can leverage AdIds for monetization of mobile traffic flowing through their network elements in the mobile advertising ecosystem. Thus, the AdId to subscriber ID mapping:
This patent application claims priority to and benefit of the filing date of U.S. Provisional Patent Application No. 62/710,212, entitled “Correlating Multi-Dimensional Data to Extract & Associate Unique Identifiers for Analytics Insights, Monetization, QOE & Orchestration” filed Mar. 15, 2018, the entire disclosure of which is hereby incorporated herein by reference.