Embodiments pertain to client side web usage data collection.
To design systems competitively, some original equipment manufacturers (OEMs) use data collected on end-user systems. Increasingly, browser usage constitutes a significant part of personal computer usage, and therefore understanding how various types of users use browsers differently may be of importance to understand market segment requirements of personal computers.
Some web services collect raw data on servers including browser cookie tracking, for data-mining on the servers. However, raw browser usage data is private information, and collecting personal computer (PC) users' browsing behavior data in a privacy-preserving and unobtrusive way may be difficult.
Some solutions may be web service-based, requiring raw uniform resource locators (URLs) to be captured between users' requests and websites visited, potentially leaving the user system with a privacy/security risk. Additionally, the web service may log the user's Internet Protocol (IP) address and the URL may even contain personal information such as user name. Further, some solutions are intrusive in that they require a browser plugin or network sniffing.
Many secure browsing web services offer only binary classes, e.g., “child-friendly or not,” “malicious or not,” and are geared toward providing specific services to customers, e.g., parental control. Some solutions work for only broad categorization such as a top level URL domain, e.g., www.youtube.com, which may produce little to no useful information.
In embodiments, if a user opts in, a system can collect the user's browsing history and classify entries into high level system impact categories, e.g., using machine learning techniques. The usage by categories may be sent to a server to represent browser usage of system components. In embodiments, the site names do not leave the client system, to prevent URLs selected by the user from becoming public knowledge.
The following set of guidelines may be used in embodiments:
The approach presented herein is capable of classifying a broad range of web site categories by computer system behavior, and may be utilized to determine system component usage for PC designers. Classification may be based on the entire URL, so that most frequently used pages within a domain can be characterized.
Embodiments include machine learning models that can be tuned to any number of categories so as to be appropriate to a privacy sensitivity of each user, addressing common privacy guidelines. For example, specialized user experience studies may make use of machine learning models that correspond to a detailed list of fine-grained categories, e.g., to be applied with users who opt in to a detailed usage collection. On a general usage system, “fuzzier” and smaller number of categories may be used, e.g., resulting in on-client models that may be much smaller and faster. Because cookies are not used in the embodiments presented herein, the models in the embodiments presented would be difficult to be co-opted for unintended purposes, e.g., for information gathering such as specific URLs accessed by a user.
Another benefit of the client side decentralized approach is that the overall computation can be treated as massively parallel, in contrast to a web services-based approach where a number of page hits to the web service from all the clients can be huge, potentially requiring an expensive server infrastructure investment.
A first phase 102 is model-building. This is an offline model preparation phase that uses machine learning and text mining. Models generated are able to predict one or more web-categories, given a URL and some page title information.
In an embodiment, phase 102 proceeds as follows:
W
u=−log2(Ru/2N)
P(Y=cj)=1/(1+e−(β
A second phase 110 includes data collection and classification. A low intensity collector in the client system, e.g. personal computer (PC), gathers web usage data 112 that includes minimal browsing history data (e.g., URLs and page titles) and system utilization, e.g., CPU consumption, by the web sites visited. The history data is then tokenized and passed into a classifier 116 to perform a classification, e.g., determine a corresponding category in which to place each URL. The classifier 116 uses the classification models 114 learned in phase 102 to determine output 118 that includes a quantitative classification of the web site accesses, to be sent to a database 120. The classification suppresses the identity of each website, and instead presents a quantitative measure of website access (e.g., based on website access frequency and website access durations) according to each category.
A third phase 130 is server data processing. Anonymous and de-identified information is uploaded to the server from the database 120, e.g., for analysis. The analysis may be used as system use feedback in analytics that may, e.g., influence product improvement of components, design specifications of hardware or software, etc.
The above-described approach includes a trained/learned information transformation algorithm that produces compression of information with intentional loss of precision, while focusing on de-identifying personal information. Categories can be coarse and privacy-preserving. An algorithm may be invoked to automatically prune thousands of fine-grained categories (e.g., retrieved from dmoz.org) into a smaller number of categories. A further refinement process may be invoked to preserve privacy of categories, e.g., through a filter that provides “sanity checks” constructed according to privacy principles e.g., developed by privacy experts and via user studies. The user studies or surveys can be conducted periodically, e.g., annually, semi-annually, etc., and may be automated. In one embodiment, the final number of categories to be used for classification is between 10 and 100.
In embodiments, classification (e.g., category determination) of URLs happens locally on the user's system, unlike many solutions where the explicit URLs are sent to a web service that potentially exposes the user's IP address and where the web server can store sensitive web usage data server.
In embodiments, a non-intrusive, secure collector is used. The collector is neither a plug-in to the browsers that can make browsers unstable and pose security risks, nor it is a network packet sniffer.
In operation, the collection logic 214 (e.g., hardware, software, firmware, or a combination thereof) may be executed in the core 2121 and upon execution may collect, during a usage period, a history of URLs (optionally including a title on a corresponding title page of each URL) accessed by a user and corresponding elapsed access times. The collection logic 214 can pass the collected history to the classification logic 216, which can classify the URLs according to the classification models 220 (e.g., developed accorded to model building described above) that are typically stored in the nonvolatile memory 218. For example, each classification model can indicate, based on URL information received, whether the URL in question falls in the category corresponding to the classification model. Generally, categories are constructed to be non-overlapping. Additionally, the categories are constructed so as to suppress detailed personal preference information, e.g., the URL of each website accessed.
A classification report that is output from the classification logic 216 may include a relative importance of each category determined from the URL access history received, e.g. a numerical value associated with the category for the particular access history being analyzed. The complete classification report (also classification summary, or categorization summary herein) for the particular URL access history typically may include a corresponding value for each category based on, e.g., a count of URLs and access time of each URL. The classification report output suppresses (e.g., omits) the identity of each URL in order to protect privacy of the user. The classification report may be output to server 230.
The server 230 may store the classification report. The classification report may be used to determine modification of a future generation of the system 202. For example, the server 230 may collect many classification reports from various users and may analyze the classification reports received to produce an analysis that may point to inferences based on the populations of each of the categories. The analysis may be used as a basis, e.g., in analytics, to implement design changes, e.g., to effect improvement in utility of the system by users.
Referring to
Moving to block 308, a subset of the determined categories may be selected, depending on the granularity specified. Proceeding to block 310, a classification model may be built for each category using L1 regularization, linear regression, etc. Each model is associated with a corresponding category and can provide a quantitative measure of a fit of a URL to the corresponding particular category. The models may be used to determine in which category to place a URL that is logged, e.g., in a URL access summary of a user.
Continuing to block 504, the server analyzes the plurality of classifications received from the various PCs to determine system usage trends among the participants of the study. Advancing to block 506, the server can use the analysis of the classifications in analytics that can, e.g., provide input to update design requirements of PCs and PC components, improve user experience, etc.
Referring now to
In turn, the application processor 610 can couple to a user interface/display 620, e.g., a touch screen display. In addition, application processor 610 may couple to a memory system including a non-volatile memory, namely a flash memory 630 and a system memory, namely a dynamic random access memory (DRAM) 635. As further seen, application processor 610 further couples to a capture device 640 such as one or more image capture devices that can record video and/or still images.
Still referring to
As further illustrated, a near field communication (NFC) contactless interface 660 is provided that communicates in a NFC near field via an NFC antenna 665. While separate antennae are shown in
To enable communications to be transmitted and received, various circuitry may be coupled between baseband processor 605 and an antenna 690. Specifically, a radio frequency (RF) transceiver 670 and a wireless local area network (WLAN) transceiver 675 may be present. In general, RF transceiver 670 may be used to receive and transmit wireless data and calls according to a given wireless communication protocol such as 3G or 4G wireless communication protocol such as in accordance with a code division multiple access (CDMA), global system for mobile communication (GSM), long term evolution (LTE) or other protocol. In addition a GPS sensor 680 may be present. Other wireless communications such as receipt or transmission of radio signals, e.g., AM/FM and other signals may also be provided. In addition, via WLAN transceiver 675, local wireless communications can also be realized.
Additional embodiments are described below.
A first embodiment is a system that includes a processor including at least a first core that includes collection logic to record a history of website accesses of a plurality of websites by a user. The processor also includes classification logic to assign the website accesses to corresponding categories by application of a plurality of models, where each model corresponds to a respective category, and to determine a classification summary that includes a plurality of category metrics, each category metric associated with the respective category, each category metric based on a corresponding measure of the website accesses within the respective category, where the classification summary suppresses a corresponding identity of each website accessed. The system also includes a nonvolatile memory coupled to the processor.
A 2nd embodiment includes elements of the 1st embodiment, where the nonvolatile memory is to store a representation of each of the plurality of models.
A 3rd embodiment includes elements of the 1st embodiment, where each category metric is to include a respective frequency statistic that is based on a count of the website. accesses of the websites assigned to the corresponding category during a determined time period.
A 4th embodiment includes elements of the 1st embodiment. Additionally, each category metric is to include a respective temporal statistic that is based on a cumulative time duration of the website accesses of the websites assigned to the corresponding category during a determined time period.
A 5th embodiment includes elements of the 1st embodiment, where a category count of the categories is less than approximately 100.
A 6th embodiment includes elements of any one of embodiments 1-5, where each category corresponds to a unique set of websites and each website is to be included a single corresponding category.
A 7th embodiment is a method that includes gathering, by a server, website identification data of a plurality of websites and corresponding popularity data; determining by the server an initial set of categories based on the website identification data and the corresponding popularity data; applying a category reduction filter to the initial set of categories to exclude a subset of categories that corresponds to private information of a user that is to access websites via a user system, to produce a reduced set of categories; constructing a final set of categories from the modified set of categories according to a specified count of categories in the final set of categories; building a plurality of models, each model associated with a corresponding category of the final set of categories, each model to provide a quantitative measure of a fit of a particular website for inclusion in the corresponding category; and providing a classification tool to the user system, where the classification tool includes the plurality of models and the final set of categories, where each model is identified with its corresponding category.
An 8th embodiment includes elements of the 7th embodiment, where constructing the final set of categories includes combining two or more categories of the modified set of categories to reduce a count of distinct categories to be included in the final set of categories.
A 9th embodiment includes elements of the 7th embodiment, where building the models includes applying training data to the final set of categories using one or more machine learning techniques.
A 10th embodiment includes elements of the 9th embodiment, where each model is formed based at least in part on universal resource locators (URLs) and corresponding page titles of the training data.
An 11th embodiment includes elements of the 7th embodiment, and further includes periodically updating the classification tool by repeating gathering the website data, determining the initial set of categories, applying the category reduction filter, constructing the final set of categories, and forming the plurality of models.
A 12th embodiment includes elements of the 7th embodiment, where periodically updating the classification tool further comprises periodically updating the category reduction filter.
A 13th embodiment includes elements of the 7th embodiment, where at least some of the categories in the final set of categories pertain to system usage of the user system.
A 14th embodiment includes elements of the 7th embodiment, where the classification tool is to output a classification summary that includes a measure of website accesses for each category of the final set of categories.
A 15th embodiment includes elements of the 14th embodiment, where the classification summary is to suppress an identity of each universal resource locator (URL) of each website represented within a particular category.
A 16th embodiment includes elements of any one of the 7th to the 15th embodiments further includes constructing the category reduction filter based on expert input received from at least one expert source.
A 17th embodiment is a machine readable medium having stored thereon instructions, which if performed by a machine cause the machine to perform a method that includes receiving, by a server from each of a plurality of user systems, a respective classification summary that includes, for each category of a set of categories, a category metric that includes a frequency statistic including a measure of website accesses of websites assigned to the category during a defined time period, where the classification summary is to suppress a corresponding identity of each of the websites assigned to each category; performing an analysis of the classification summary received; and determining modifications of user system design requirements based at least in part on the analysis.
An 18th embodiment includes elements of the 17th embodiment, where at least some of the categories of the set of categories pertain to system usage of each user system from which the classification summaries are received.
A 19th embodiment includes elements of the 17th embodiment, where suppression of the corresponding identity of each of the websites assigned to each category includes prevention of determination of a corresponding universal resource locator (URL) and a corresponding page title of each of the websites reflected in the classification summary.
A 20th embodiment includes elements of any one of the 17th to the 19th embodiments, where each category metric further includes a time duration statistic determined based on a sum of time durations of access, during the defined time period, of each of the websites within the corresponding category.
A 21st embodiment is a method that includes receiving, by a server from each of a plurality of user systems, a respective classification summary that includes, for each category of a set of categories, a category metric that includes a frequency statistic including a measure of website accesses of websites assigned to the category during a defined time period, where the classification summary is to suppress a corresponding identity of each of the websites assigned to each category; performing an analysis of the classification summary received; and determining modifications of user system design requirements based at least in part on the analysis.
A 22nd embodiment includes elements of the 21st embodiment, where at least some of the categories of the set of categories pertain to system usage of each user system from which the classification summaries are received.
A 23rd embodiment includes elements of the 21st embodiment, where suppression of the corresponding identity of each of the websites assigned to each category is to prevent determination of a corresponding universal resource locator (URL) and a corresponding page title of each of the websites reflected in the classification summary.
A 24th embodiment includes elements of any one of the 21st to the 23rd embodiments, where each category metric further includes a time duration statistic determined based on a sum of time durations of access, during the defined time period, of each of the websites within the corresponding category.
A 25th embodiment is a system that includes a server including at least one processor to: receive from each of a plurality of user systems, a respective classification summary that includes, for each category of a set of categories, a category metric that includes a frequency statistic including a measure of website accesses of websites assigned to the category during a defined time period, where the classification summary is to suppress a corresponding identity of each of the websites assigned to each category; perform an analysis of the classification summary received; and recommend modifications of user system design requirements based at least in part on the analysis.
A 26th embodiment includes elements of the 25th embodiment, where at least some of the categories of the set of categories pertain to system usage of each user system from which the classification summaries are received.
A 27th embodiment includes elements of the 25th embodiment, where suppression of the corresponding identity of each of the websites assigned to each category includes to prevent determination of a corresponding universal resource locator (URL) and a corresponding page title of each of the websites reflected in the classification summary.
A 28th embodiment includes elements of any one of embodiments 25-27, where each category metric further includes a time duration statistic determined based on a sum of time durations of access, during the defined time period, of each of the websites within the corresponding category.
A 29th embodiment is a method that includes recording a history of website accesses of a plurality of websites by a user; assigning the website accesses to corresponding categories by application of a plurality of models, where each model corresponds to a respective category; and determining a classification summary that includes a plurality of category metrics, each category metric associated with the respective category, each category metric based on a corresponding measure of the website accesses within the respective category, where the classification summary suppresses a corresponding identity of each website accessed.
A 30th embodiment includes elements of the 29th embodiment, where each category metric is to include a respective frequency statistic that is based on a count of the website accesses of the websites assigned to the corresponding category during a determined time period.
A 31st embodiment includes elements of the 29th embodiment, where each category metric is to include a respective temporal statistic that is based on a cumulative time duration of the website accesses of the websites assigned to the corresponding category during a determined time period.
A 32nd embodiment includes elements of the 29th embodiment, where a category count of the categories is less than approximately 100.
A 33rd embodiment includes elements of any one of embodiments 29-32, where each category corresponds to a unique set of websites and each website is to be included a single corresponding category.
Embodiments may be used in many different types of systems. For example, in one embodiment a communication device can be arranged to perform the various methods and techniques described herein. Of course, the scope of the present invention is not limited to a communication device, and instead other embodiments can be directed to other types of apparatus for processing instructions, or one or more machine readable media including instructions that in response to being executed on a computing device, cause the device to carry out one or more of the methods and techniques described herein.
Embodiments may be implemented in code and may be stored on a non-transitory storage medium having stored thereon instructions which can be used to program a system to perform the instructions. Embodiments also may be implemented in data and may be stored on a non-transitory storage medium, which if used by at least one machine, causes the at least one machine to fabricate at least one integrated circuit to perform one or more operations. The storage medium may include, but is not limited to, any type of disk including floppy disks, optical disks, solid state drives (SSDs), compact disk read-only memories (CD-ROMs), compact disk rewritables (CD-RWs), and magneto-optical disks, semiconductor devices such as read-only memories (ROMs), random access memories (RAMs) such as dynamic random access memories (DRAMs), static random access memories (SRAMs), erasable programmable read-only memories (EPROMs), flash memories, electrically erasable programmable read-only memories (EEPROMs), magnetic or optical cards, or any other type of media suitable for storing electronic instructions.
While the present invention has been described with respect to a limited number of embodiments, those skilled in the art will appreciate numerous modifications and variations therefrom. It is intended that the appended claims cover all such modifications and variations as fall within the true spirit and scope of this present invention.