The invention relates to methods, systems, and computer program products for WAP Browsing Analysis In On And Off Portal Domains
With the advent of mobile technology, and mobile media some mobile operators are moving beyond the concept of a walled garden into the off portal realm. For years, users of many mobile operators have been confined to consume content provided by the operator in its content portal. As mobile handsets are more common and as more innovative services can be offered, many new content providers and aggregators are moving into the value chain offering customers content which is not necessarily associated with the mobile operator portal. Further, the mobile industry is now moving ahead into mobile advertising where users are being presented with advertisements while previewing mobile content or while doing some kind of contextual related activity like searching for a specific content. To keep up with the competition, mobile operators can't rely only on themselves for supplying interesting content, and thus their business models need to be adjusted to incorporate the usage of off-portal content and service providers.
For example, the following scenarios exemplify some emerging business models:
As the examples above show, the operator needs to be able to quantify the level of usage for a specific content provider's site in order to enable profitable business models. The challenge with this effort is that the operators systems lack the full information on users consumptions and the operator has no way to validate information coming from the content provider. Specifically, in WAP communication, operators need to rely on their Wap Gateways logs. Due to the WAP protocol structure, pages components do not come in a structure way but rather as a stream of objects embedded in the main root objects. Such objects can be media or text objects or even embedded pages. Further, as the user can interact with the flow by entering a new URL or pressing an embedded link, the stream can change in the middle to start serving a new page. Thus, the challenge stands to be how to reconstruct effectively and accurately users surfing by identifying the pages they surf to in off portal (namely, in sites that are not being served by the operator and thus no knowledge exists on their site structure).
The current invention includes a system for analysis of hyperlinked based traffic (such as web, mobile web) in off portal domains using a URL syntax analysis.
A Method, a System and a Computer Program Product for WAP Browsing Analysis In On And Off Portal Domains.
The foregoing and other objects, features, and advantages of the present invention will become more apparent from the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, similar reference characters denote similar elements throughout the different views, in which:
Specification of the proposed algorithm follows.
It is assumed that the algorithm has access to the WAP Gateway log where requests for objects can be found (whether root URLs or embedded objects).
Several ideas lead to the method presented. First, is that the operator does not always need information at the granularity of the single page. Billing by a page for a tier 1 operator would be an impossible endeavor. Thus, some level of aggregation is in place to allow the operator and the content provider to discuss usage in granularity that is higher than domain but lower than a single page. Further, techniques to identify a logical page (namely, that two pages that look differently due to personalization for example, are logically the same. For example, the entry page for an online merchant or ones bank account page) may incur high processing if their input would be too big. Thus, limiting the analysis to a set of pages may be beneficial. To make this effective, some technique to do this at a granularity higher than the domain level is required.
In the first filtering step, the algorithm filters the log file information to include only requests for page objects. These can be both root pages or embedded pages. The critical issue here is that the input for the algorithm at this stage strips the input data from embedded objects which are not links for page objects such as text/HTML or text/WML MIME types as examples.
In order to confront requests generated by robots, frames and occurrences where users hit links before the page fully loaded, page requests (page URL with the right MIME type) are also filtered using the following heuristic. The log (with only page URLs) is scanned and URLs are removed if their request time lag is less than 1 second from the previous page request response. This heuristic can be replaced with other rules based on the evolution of the environment (if new log entries based on new entities in the WAP habitat will be developed).
At this stage the list of domains the algorithm will analyze is being generated using either of two ways:
This is the main phase of the algorithm.
Once the analysis objects have been defined (the page families), the algorithm re-scans the log file to collect information on each page family. For each URL that the algorithm identifies, the algorithm collects information and aggregates it as part of the page family the page belongs to. The algorithm scans the log between URLs and associates the information with the previous URL (namely, it is assumed that embedded objects belong to the URL that comes before them). By URL we refer to MIME page objects such as text/HTML text/WML etc, or any other MME type that represents a page object.
Statistics that are been calculated include:
The algorithm can be run in two ways:
It has been shown that sometimes web sites erroneously send non page items such as GIFs as page elements. As the algorithm takes advantage of large numbers phenomenon, such behaviors will be trapped by the algorithm statistical mechanisms. In any case, page elements where there is a question (for example, they are near in 0 time to another page element and due to low occurrence may be suspicious as non page) can be isolated in an error group. This group may be inspected from time to time by an automated crawler that will try to fetch the pages to examine if indeed they are a page or an error on behalf of the web site. A Longtail approach can be employed where only erroneous pages with high occurrence will be examined.
When the algorithm finishes (for a certain timeframe), it includes a list of page families where each is associated with some data. These can be presented using different methods:
The user can use filtering to further adjust the presentation like selecting a threshold of page family frequency to be presented etc. also, the user can select page families by their URL syntax, for example, page families with/News/in their tokens.
The following represents an example usage scenario for the algorithm's results:
1. As an example, lets assume that www.provider.com negotiates with operator MyMobile for access to its content by MyMobile users. The provider claims that 1 million pages have been accessed.
2. MyMobile will look at the report at www.provider.com and compare the hits number with the supplier number. In case some inconsistencies arise, the operator can use different page families as validation hooks to try and spot where the inconsistency comes from. For example, he can as the provider to supply hits information at the level of ‘www.provider.com/sports/footbal/europe’
3. Further, the operator can look at the linkage list between page families to spot that many links exist between www.provider.com and sports.provider.com and the missing hits can be associated with browsing at the later domain.
The ability to identify pages in browsing also amends itself to more complex analysis such a s session analysis in on portal browsing where the operator aims to find the most common browsing sessions. For this, frequencies are calculated for movement patterns between page families.
This algorithm can run as a pre-processing step before the main algorithm run, or as a refining phase once page families are identified. Further, this approach can be extended to support the full solution to the business problem this patent comes to solve.
Input to the algorithm—the log file arranged first by msisdn's and then by time.
Ideally, timing information would be provided with ms accuracy, but the algorithm can also manage with an accuracy of just seconds.
During the first pass, the algorithm constructs the following data structures:
From every user to all the visited URLs ordered by time, e.g. as illustrated in
From every URL to all the users who visited this URL:
It should be noted that all information in the URL address after the first “?” is currently dropped. This can be improved by searching for many users who visited the same URL
During the second pass, the algorithm picks out the URLs in order of decreasing frequency (i.e. most popular URL first).
Referring to
In the above example, all URLs are within X sec of aURL. URL2 and URL6 are comfortably within the Y seconds limit of the preceding URL, and are included in the neighborhood. URL3 just barely makes the cut. However, URL7 is not within Y sec of URL3 and will therefore not be included in aURL's neighborhood. URL8 will not even be considered, since there was a break in the neighborhood, between URL3 and URL7.
For each nURL, we record the following:
We summarize in a table the expected values for the above variables in 2 cases: when nURL truly belongs in the same physical page as aURL (Case 1), and when it belongs in a different page (Case 2). These values are for the average case, and therefore only applicable for a large population (i.e. for popular URL's). Individual behavior patterns vary greatly.
The algorithm is fine tuned using a test sample (where it is known which URLs belong in the same page). This gives a collection of association rules (or any other data mining algorithm such as decision trees, neural nets, etc. . . . ) for the above parameters: a certain set of parameters values would indicate that 2 URLs are on the same page, whereas a different region in the parameter space would indicate that the URLs belong on different pages.
The present invention can be practiced by employing conventional tools, methodology and components. Accordingly, the details of such tools, component and methodology are not set forth herein in detail. In the previous descriptions, numerous specific details are set forth, in order to provide a thorough understanding of the present invention. However, it should be recognized that the present invention might be practiced without resorting to the details specifically set forth.
Only exemplary embodiments of the present invention and but a few examples of its versatility are shown and described in the present disclosure. It is to be understood that the present invention is capable of use in various other combinations and environments and is capable of changes or modifications within the scope of the inventive concept as expressed herein.
This application claims the benefit of U.S. Provisional Patent Application No. 61/023,216, filed on Jan. 24, 2008, which is incorporated in its entirety herein by reference.
Number | Date | Country | |
---|---|---|---|
61023216 | Jan 2008 | US |