This application relates to data-analysis tools and techniques, which may be used, for example, to analyze a user's activities at their computer workstation.
With the advent of distributed computer networks connecting multiple users to large databases of information, the personal computer has emerged as an important research and informational tool. The largest such network, commonly known as the Internet, has given computer users unprecedented access to a seemingly limitless amount of information (including government records, publication databases, and other information sources). In certain situations, however, it is desirable to monitor a user's activity concerning such networks as well as monitoring the user's other activities at their workstation. For example, it may be desirable for some business owners to monitor the activities of their employees as they work at their workstations (e.g., to better understand their employees' analytical processes or to account for computer and Internet activity). Simply recording what events occur at a user's workstation (i.e., recording the low-level, user activity such as keystrokes, mouse actions, file accesses, and network-access requests), however, can produce an enormous amount of user-activity data that does not offer much insight into what the user was actually intending to do. Accordingly, techniques and tools that help analyze low-level, user-activity data and extract targeted information indicative of what the user intended to do are desirable.
Disclosed below are representative embodiments of methods, apparatus, and systems for analyzing user-activity data. The disclosed methods, apparatus, and systems should not be construed as limiting in any way. Instead, the present disclosure is directed toward all novel and non-obvious features and aspects of the various disclosed embodiments and their equivalents, alone and in various combinations and sub-combinations with one another. Further, the disclosed methods, apparatus, and systems are not limited to any specific aspect, feature, or combination thereof, nor do the disclosed methods, apparatus, or systems require that any one or more specific advantages be present or problems be solved.
In one disclosed embodiment, user-activity data is received. The user-activity data of this embodiment comprises one or more network-access requests (e.g., uniform-resource-locator (URL) addresses accessed by the computer workstation). A selected network-access request from the user-activity data (e.g., a network-access request that is determined to be responsive to an immediately prior user-interface event, such as a keystroke or mouse event) is compared to one or more known non-user-initiated network-access requests. The selected network-access request is designated as being a user-initiated network-access request based at least in part on the comparison. A list of targeted user activities comprising at least the designated user-initiated network-access request can be output. In certain implementations, the act of comparing includes determining that the selected network-access request does not match any of the known non-user-initiated network-access requests. In some implementations, the known non-user-initiated network-access requests are stored in one or more lists of known non-user-initiated network-access requests. These lists might comprise, for example, URL addresses known to be secondary URL addresses or URL addresses known to be of a non-primary type. The selected network-access request may be a first network-access request, and the method may further comprise identifying a second selected network-access request as being a non-user-initiated network-access request from the user-activity data. One of the lists of non-user-initiated network-access requests may then be updated to include the non-user-initiated network-access request identified. In such implementations, the method may further comprise determining that the second selected network-access request does not immediately follow a user-interface event.
In another disclosed embodiment, data indicating activity at a computer workstation is received. In this embodiment, the data comprises entries indicative of network-access requests from the computer workstation (e.g., URL addresses). The network-access requests comprise both user-initiated network-access requests and non-user-initiated network-access requests. One or more of the network-access requests are designated as user-initiated network-access requests via the data indicating activity at the computer workstation. In certain implementation, the user-activity data additionally comprises entries indicative of user-interface events, and the method includes identifying at least on of the user-interface events as a user-interface event initiating at least one of the network-access requests. The act of designating one or more of the network-access requests as user-initiated network-access requests may, in some implementations, comprise searching one or more lists of non-user-initiated network-access requests. Additionally, one or more of the network-access requests may be identified as non-user-initiated network-access requests via the data. One or more search queries may also be identified from the network-access requests. The method can further comprise updating a list of non-user-initiated network-access requests with one or more of the network-access requests identified as non-user-initiated network-access requests.
In another disclosed embodiment, user-activity data is received. In this embodiment, the user-activity data comprises one or more network-access requests (e.g., URL addresses). A selected network-access request from the user-activity data is compared to known search-engine-query addresses. By matching the selected network-access request to one of the known search-engine-query addresses, the selected network-access request is identified as being a search-engine query. A user query to the search engine may also be identified from the selected network-access request. The method may further comprise outputting a list of targeted user activities, wherein the list of targeted user activities comprises at least the search-engine query identified. In certain implementations, the known search-engine-query addresses comprise URL addresses for known Internet search engines.
In another disclosed embodiment, user-activity data is received. In this embodiment, the user-activity data comprises one or more file-activity events, wherein each file-activity event is indicative of a respective file that was accessed by a computer workstation and a process that accessed the respective file on the computer workstation. Two or more of the file-activity events are clustered together. In this embodiment, the clustered file-activity events involve a common process accessing a common file within respective time intervals from one another. The clustered file-activity events are classified as being representative of a targeted file action. In certain implementations, the act of classifying comprises comparing a time associated with the clustered file-activity events to a creation time and a modification time of the common file, and designating the clustered file-activity events as representing a creation, a modification, or an opening of the common file based at least in part on the comparison. The acts of comparing and designating may be performed for the clustered file-activity events only after the clustering is determined to be complete. In some implementations, the method also comprises deleting a selected file-activity event from the user-activity data if the selected file-activity event indicates access to a file on a list of excluded files (e.g., a list comprising temporary files). A list of targeted user activities comprising at least the targeted file action represented by the clustered file-activity events can be output. In certain implementations, the acts of clustering and classifying are performed substantially as the user-activity data is received.
In another disclosed embodiment, network-access requests from a computer workstation and network responses to the network-access requests are monitored. A network response is identified that directs the computer workstation to perform a window title change (e.g., a network response comprising an HTML directive to change window titles). The identified network response is received in response to a corresponding network-access request (e.g., a network-access request comprising a URL address). A determination is made that a window on the computer workstation changed as a result of the identified network response, and the window is associated with the corresponding network-access request. In some implementations, the act of determining comprises evaluating whether the window on the computer workstation changed titles within a predetermined period of time of the identified network response and whether a new title of the window matches a title directed by the identified network response. The method may further comprise displaying the corresponding network-access request to a user when the associated window is active. In certain implementations, the acts of identifying, determining, and associating are performed substantially concurrent with the monitoring.
In another embodiment, a method for analyzing user-activity data is disclosed. In this embodiment, two or more data streams of low-level, user-activity data are detected at a computer workstation via two or more respective sensors. In this embodiment, the two or more respective sensors comprise at least a first sensor configured to detect network-access requests and a second sensor configured to detect at least one of the following: file-activity events, window-title-change events, or user-interface events. Targeted user activity is identified from at least one of the data streams. The targeted user activity is stored, whereas the remainder of the data stream from which it was identified is disregarded. In some implementations, the targeted user activity is identified using a combination of at least two of the data streams. The targeted user activity can comprise, for example, a user initiating a network access; performing a search on a search engine; creating, opening, or modifying a file; or initiating a network access that causes a window title to change. In certain implementations, the act of identifying the targeted user activity is performed substantially as a corresponding data stream is received. In some implementations, the targeted user activity is displayed via a graphical user interface and/or stored in a list of targeted user activity on one or more computer-readable media.
Any of the disclosed methods may be implemented as computer-readable media comprising computer-executable instructions for causing a computer to perform the method. Further, computer-readable media comprising lists at least partially created or modified by the disclosed methods are also provided. The disclosed embodiments may also be implemented (partially or completely) in hardware (e.g., one or more integrated circuits).
The foregoing and additional features and advantages of the disclosed embodiments will become more apparent from the following detailed description, which proceeds with reference to the following drawings.
Disclosed below are representative embodiments of methods, apparatus, and systems for analyzing user-activity data (e.g., a user's activity at a computer workstation). The disclosed methods may be used, for example, in software or hardware tools (or combinations thereof) that detect, record, analyze, and/or display user-activity data.
The disclosed methods, apparatus, and systems should not be construed as limiting in any way. Instead, the present disclosure is directed toward novel and non-obvious features and aspects of the various disclosed embodiments and their equivalents, alone and in various combinations and sub-combinations with one another. Moreover, the methods, apparatus, and systems are not limited to any specific aspect or feature, or combination thereof, nor do the disclosed methods, apparatus, and systems require that any one or more specific advantages be present or problems be solved.
Although the operations of some of the disclosed methods, apparatus, and systems are described in a particular, sequential order for convenient presentation, it should be understood that this manner of description encompasses rearrangement, unless a particular ordering is required by specific language set forth below. For example, operations described sequentially may in some cases be rearranged or performed concurrently. Moreover, for the sake of simplicity, the attached figures may not show the various ways in which the disclosed methods, apparatus, and systems can be used in conjunction with other methods, apparatus, and systems. Additionally, the description sometimes uses terms like “determine” and “identify” to describe the disclosed methods. These terms are high-level abstractions of the operations that are performed. The operations that correspond to these terms will vary depending on the particular implementation and are readily discernible by one of ordinary skill in the art.
The disclosed embodiments can be implemented in a wide variety of environments. For example, any of the disclosed techniques can be implemented in software comprising computer-executable instructions stored on a computer-readable medium. Such software can comprise, for example, monitoring or instrumenting software used to capture and record user activities on a multi-user and/or networked computer system. Such software can be executed on a single computer or on a networked computer (e.g., via the Internet, a wide-area network, a local-area network, a client-server network, or other such network). For clarity, only certain selected aspects of the software-based implementations are described. Other details that are well known in the art are omitted. For example, it should be understood that the disclosed technology is not limited to any specific computer language, program, or computer. For the same reason, computer hardware is not described in further detail. Any of the disclosed methods can alternatively be implemented (partially or completely) in hardware (e.g., on a system-on-a-chip (SoC), application-specific integrated circuit (ASIC), or programmable logic device (PLD), such as a field programmable gate array (FPGA)).
The disclosed technology is generally applicable to any field in which it is desirable to record and analyze a user's activities (e.g., commercial businesses monitoring their employees or Website, parents monitoring the computer and Internet activity of their children, non-commercial research, intelligence analysis, and other such fields).
In order to gather as much information as possible about a user's work, it is desirable to capture and record the user's activities on their workstation 102. With reference to
A proxy server is used in certain embodiments of the disclosed technology because most computer systems (for example, computers using the Microsoft® Windows® or Unix® operating systems) do not provide explicit “hooks” that monitoring software can use to detect and record network-access requests. In other embodiments, however, network-access requests are detected and recorded without using a proxy server. For example, depending on the configuration of the user's computer system, application-program-interface (API) hooking (e.g., the Microsoft® Detours® package) can be used to obtain explicit notifications of network-access requests performed by a user's browser. Although some browsers provide their own APIs, this technique typically requires detailed knowledge of the browser's internal operation, which may be different for each browser and is subject to change without notice. It can therefore be difficult to obtain the desired information directly and unambiguously from the user's computer system. Accordingly, it is often more practical, though not necessary, to use a proxy server inserted into the communication path between the user's browser and the network (e.g., the Web).
The one or more sensors 106 can additionally or alternatively comprise file- and/or operating-system monitors adapted, for example, to record files or applications accessed by a user (referred to herein as “file-activity events”), windows opened or closed by the user, and other such operational data. The one or more sensors 106 can additionally or alternatively comprise monitors adapted to detect the user's keystrokes (e.g., depressions and releases of keys) at the workstation 102 and/or to receive at least some of the user's pointer-device actions (e.g., depressions and releases of mouse buttons). Keystrokes and pointer-device actions are collectively referred to herein as “user-interface events.” This term is not limited, however, and may include other user-initiated actions associated with input/output devices of the workstation 102 (e.g., spoken commands). To detect user-interface events, file-activity events, and window-related events, system-wide “hooking” (e.g., Windows® system-wide hooking) may be utilized. In some situations, a proxy server configured to record file-activity events between a user's workstation and a network server may also be used.
Monitoring software that is run on the user's workstation 102 (or on a connected monitoring computer) can be adapted to receive the output of the one or more sensors 106 and to create one or more lists of user activity. As used herein, the term “list” refers to a collection or arrangement of data that is usable by a computer system. A list may be, for example, a data structure or combination of data structures (such as a queue, stack, array, linked list, heap, or tree) that organizes data for better processing efficiency, or any other structured logical or physical representation of data in a computer system or computer-readable media (such as a table used in a relational database). Moreover, any of the lists discussed herein may be persistent (that is, the list may be stored in computer-readable media such that it is available beyond the execution of the application creating and using the list) or non-persistent (that is, the list may be only temporarily stored in computer-readable media such that it is cleared when the application creating and using the list is closed or when the list is no longer needed by the application).
In one exemplary configuration, the monitoring software captures and records network-access requests (e.g., the URL addresses associated with a Web access), user-interface events, window events (create, destroy, title, activate, etc.), and file-activity events into separate respective lists. These lists may be analyzed separately and afterwards combined into a single list or database of targeted user activities for convenient presentation to the user. In certain embodiments, the monitoring software is further adapted to allow the user to manually enter information about their activities. For example, the user may be able to create entries for non-workstation activities that cannot be recorded automatically by the monitoring software (e.g., meetings with other analysts or non-computer research). The monitoring software may also allow the user to insert explanatory notes regarding any of the user's activities.
The user activity that is initially detected by the sensors typically comprises one or more raw data streams containing a large amount of irrelevant data. The various entries in the data streams can be time stamped in order to allow the recreation of various actions and responses. The precision with which the data streams are time stamped and/or combined may vary from implementation to implementation, possibly affecting the reliability with which embodiments of the disclosed heuristics operate. Typically, however, relatively precise time stamping is desired (e.g., within a hundredth of a second or within a thousandth of a second). An example of raw data as may be received by the sensors is shown in
As can be seen from
As shown at process block 310 in
As shown at process block 312 in
At process block 306, the data is stored. The data stored can comprise, for example, the targeted data identified by the heuristics at process block 304 as well as data not analyzed at process block 304. For example, in some embodiments, it might be desirable to apply certain heuristics as the relevant data is being received, whereas other heuristics are desirably applied at a later time and possibly by another computer system. The data may be stored in separate lists of user-activity or as lists comprising various combinations and sub-combinations of user-activity data (such as table 200). Further, the data may be transferred to a server computer or transportable computer-readable media such that it can be analyzed later.
Turning to
The concept of user-initiated network-access requests can be described in the context of a user browsing the World Wide Web. In this context, a user-initiated network-access request occurs, for example, when the user affirmatively selects to visit a particular Website, say “http://www.cnn.com,” on their Web browser (e.g., by typing the URL address into the browser's address bar and clicking “go” or “enter,” selecting a web page from a “favorites” or “file history” menu, or clicking on a hyperlink or shortcut embedded in a web page or email). This original user-initiated access to www.cnn.com is the event that is desirably identified as “user-initiated.” When a browser visits “www.cnn.com,” however, many other “secondary” URLs are accessed automatically on the user's behalf (e.g., to load images, advertisements, article titles, etc.). For example, one visit to “www.cnn.com” can result in the browser accessing over eighty secondary URLs in addition to the “primary” URL: http://www.cnn.com. Secondary URLs are typically contained in the HTML text sent when the primary URL is accessed and are desirably identified as “non-user-initiated” network-access requests by the heuristic at process block 360. Exemplary embodiments of such heuristics are discussed in greater detail below.
As shown at process block 362, a heuristic can also be applied to the user-activity data in order to identify search queries entered by a user. In one particular embodiment, the heuristic can be adapted to identify search queries made by a user to a search engine on the Web. Thus, if the user searches for the term “United States Patent and Trademark Office” on their Web browser using the Google® search engine, the heuristic can determine not only that a search was made using the Google® search engine, but can identify the specific terms searched. Exemplary embodiments of such heuristics are also described below.
At process block 356, the targeted user activities are output. For example, in certain implementations, the targeted user activities are merged into a single list of targeted user activities that can be output to the user (e.g., via a graphical user interface) or stored in non-volatile computer-readable media. The list of targeted user activities can be created using any combination of targeted user activities identified by the one or more heuristics applied at process blocks 304 and 354. For example, in the context of monitoring a user's Web-browser activity, the list may comprise the primary URLs accessed by the user and/or queries made by the user to Internet search engines. The list may further comprise additional entries corresponding to other targeted user activities, such as targeted file actions (e.g., files opened, modified, or created by the user), user-interface events, and window-change events.
In some embodiments of the disclosed technology, the list of targeted user activities created at process block 306 is stored only temporarily (e.g., in the volatile memory of a computer system or in some other temporary computer-readable media) and thus does not persist once the computer application implementing the method 300 stops running. In other embodiments, however, the list of targeted user activities is stored in non-volatile memory or in some other persistent computer-readable media.
As can be seen from
The particular manner of presentation shown in
The number, sequence, and purpose of the heuristics shown in
In this section, embodiments of heuristics as may be applied in the general method 300 outlined above are described in greater detail. As noted above, the heuristics are not necessarily limited to the order shown in
One exemplary type of heuristic that can be used in the general method 300 shown in
In the context of a user operating a Web browser, for example, there are numerous ways that a user can initiate a Web access. For example, a Web access to a primary URL address can be initiated by a user by: (1) typing the desired URL address into the browser address bar and clicking the “go” button; (2) typing the desired URL address into the browser address bar and hitting “enter”; (3) selecting File|Open from the menu bar of the browser, typing the desired URL address, and clicking “OK”; (4) selecting File|Open|Browse, navigating to a shortcut that contains the desired URL address, and double-clicking it; (5) clicking a hyperlink to the desired URL address in a currently displayed Web page; (6) clicking an “OK” button on a Web page that initiates a hyperlink to a desired URL; (7) clicking a hyperlink to the desired URL address embedded in an email message; or (8) selecting a URL address from a “favorites” or “file history” menu.
In one exemplary implementation, the following simple heuristic can be used for identifying a user-initiated network-access request: “the first network access following a keystroke or mouse click represents a user-initiated network-access request.” This simple heuristic may fail in many different circumstances. For example, in the context of a user browsing the Web, the heuristic will fail when: (1) the user posts a request against a search engine at site S; (2) clicks on one of the hits that is returned to visit site A; (3) clicks the browser's “back” button to review the hit list; and (4) clicks on another hit to visit site B. When the “back” button is pressed, the browser will often reload many URL addresses associated with the search page, but not the primary URL of the search page S itself, as this primary URL is often cached internally by the browser.
This simple heuristic may also fail if the user's Web connection is slow. Consequently, when the user initiates a request to site A, and while that page is loading, the user may, for example, switch to a different window and type into a word processor. The user's keyboard input may then be interleaved with numerous Web-access requests being performed by the browser, thus resulting in spurious instances of Web-access requests being labeled as “user initiated,” when in fact they were not.
At process block 502, user-activity data is received. In this embodiment, it is assumed that the user-activity data received comprises network-access requests (e.g., Web-access requests), user-interface events (e.g., keystroke and mouse actions), and the corresponding times at which these events occurred.
At process block 504, a network-access request that immediately follows a user-interface event is identified. This network-access request may be identified, for example, by ordering the user-activity data chronologically and identifying a network-access request that is immediately subsequent to a user-interface event.
At process block 506, the identified network-access request is compared to network-access requests that are known to be non-user-initiated. The known non-user-initiated network-access requests may be stored in one or more lists. For example, in the context of a user browsing the Web, the one or more lists of non-user-initiated network-access requests may comprise a list of known secondary URL addresses created from empirical information. The list may be updated continuously or periodically with additional non-user-initiated network-access requests. For example, and as explained more fully below, the list can be updated using entries from the user-activity data that are determined to be non-user-initiated. In this way, the list of known non-user-initiated network-access requests grows as the heuristic is being applied.
It should be noted that it is possible for a URL address that is typically a secondary URL address to be used as a primary URL address (e.g., by inserting the secondary URL address into the address bar of a Web browser and clicking the “go” button or hitting “enter”). Such usage, however, is not typical and is not accounted for in the illustrated embodiments. The embodiments may, however, be modified to account for such behavior.
The one or more lists of non-user-initiated network-access requests may also comprise a list of network-access-request types known to be non-user-initiated. In the context of a user browsing the Web, for example, there exist certain URL-address types that are generally known to be non-primary (e.g., a URL-address type not designed to be the first URL address accessed by a Web browser when loading a Web page). For example, URL addresses with extensions such as “.js” (for Java Script) or “.css” (for Cascading Style Sheet) are of a non-primary type. Thus, any URL address containing a “.js” or “.css” extension can be identified as a URL address of a non-primary type. A URL address may contain other information that identifies it as being of a non-primary type. For example, URL addresses to a particular ad server might be designated as being of a non-primary type and included in the list. The list of network-access-request types known to be non-user-initiated typically comprises various network-access-request patterns (which may include one or more wildcard characters) tailored to identify the presence of the targeted information in a file-access request (e.g., “*/*.css/*” where the “*” represents a wildcard character).
Returning to
At process block 602, user-activity data that corresponds to a user's activities at their workstation over a specific period of time is received. In this embodiment, the user-activity data comprises Web-access requests (primary and secondary URL addresses accessed by the user's browser), user-interface events (low-level keystroke and mouse-action data from the user's workstation), window events (changes of the active window), and time data as to when each event occurred. The user-activity data received is sorted into chronological order using the time data. In one exemplary implementation, for example, the user-activity data is sorted chronologically from earlier to latest
At process block 603, an indicator flag (termed the “may-be-primary-URL” flag in
At process block 604, the next entry is selected from the chronologically sorted user-activity data.
At process block 606, a determination is made as to whether the selected entry is a “key-down” or a “mouse-up” event directed to a Web browser. This determination can be made, for example, using the window-event information recorded as part of the user-activity data and is based on the empirical observation that a user-initiated Web access usually occurs either upon the user completing a keystroke (e.g., pressing the “enter” button) or clicking a hyperlink (e.g., releasing the left-mouse button). The user-interface events on which this determination is made, however, may vary from implementation to implementation to account for additional or other user-interface events. If the entry is determined to be a “key-down” or “mouse-up” event, then at process block 608, the “may-be-primary-URL” flag is set to “true” and the method continues to process block 610. Otherwise, the method proceeds directly to process block 610.
At process block 610, a determination is made as to whether the selected entry is a Web-access request. This determination can be made, for example, by recognizing the selected entry as a URL address. If the selected entry is not a Web-access request, then the method proceeds to process block 622, where a determination is made as to whether the selected entry is the last entry. If the selected entry is a Web-access request, then at process block 612 a determination is made as to whether the value of the “may-be-primary-URL” flag is “true.” If the flag is set to “false,” then at process block 614, the selected entry is added to a list of known secondary URLs (such as the one described above with respect to
If process block 612 determines that the flag is set to “true,” however, then a comparison is made at process block 616 to determine whether the selected entry is found in: (1) the list of known secondary URLs; or (2) a list of non-primary URL-address types (as described above with respect to
At process block 622, a determination is made as to whether the selected entry was the last entry. If the selected entry is not the last entry, then the method restarts with the next entry at process block 604; if it is the last entry, then the user-initiated Web-access requests are output at process block 624. The user-initiated Web-access requests may be output as part of a list of targeted user activities (such as the list created at process block 356 of
For purposes of this example, the analysis of the user-activity data in table 200 begins with entry 201. A determination is made that the entry is not a “key-down” or “mouse-up” event directed to a browser (process block 606), or a Web-access request (process block 608). Accordingly, because entry 201 is not the last entry (process block 622), the method 600 is repeated for entry 202.
Entry 202 is a “mouse-up” event (process block 606). Accordingly, the “may-be-primary-URL” flag is set (process block 608). Because the entry 202 is not a Web-access request, the method 600 continued with entry 203 (process blocks 610, 622).
Entry 203 is not a “key-down” or “mouse-up” event directed to a browser (process block 606), but is identified as a Web-access request (process block 610). Further, because the “may-be-primary-URL” flag is set (process block 612), the entry is compared to a list of known secondary URL and known non-primary URL types (process block 616). Assume for purposes of this example that entry 203 is not found in either of the lists. Accordingly, the entry is designated as a “user-initiated Web-access request” (process block 618) and the “may-be-primary-URL” flag is reset to false.
The next entry, entry 205, is also identified as a Web-access request (process block 610), but the “may-be-primary-URL” flag is identified as being set to “false” (process block 612). Accordingly, the entry 205 is deemed to be a secondary URL and is added to the list of known secondary URLs (process block 614). The next few entries, through entry 206, are similarly identified as being secondary URLs, and are all added to the list of known secondary URLs.
With entry 207, the entry is identified as a “mouse-up” event directed to a browser (process block 606). Accordingly, the “may-be-primary-URL” flag is set (process block 608).
Entry 208 is then recognized as a Web-access request (process block 610). Further, because the “may-be-primary-URL” flag is set (process block 612), the entry is compared to the lists of known secondary URLs and known non-primary types (process block 616). Assume for purposes of this example that entry 208 is not found in either list. Accordingly, the entry 208 is designated as a “user-initiated Web-access request” (process block 618) and the “may-be-primary-URL” flag is reset (process block 620).
Because the “may-be-primary-URL” flag is reset to false and there are no intervening “key-down” or “mouse-up” events directed to a browser, the next few entries are determined to be secondary URLs, which are added to the list of known secondary URLs. After entry 210 is analyzed using the method 600, the user-initiated Web-access requests are output (process block 624).
The heuristic described above does not necessarily need to operate in the sequence shown above, as certain described operations may in some cases be rearranged or performed concurrently. Moreover, the particular titles of the various lists and flags described above should not be construed as limiting, as they may change from implementation to implementation. Additionally, the heuristic can be modified in several respects to identify other types of user-initiated Web accesses. For example, the heuristic can be modified to account for the situation where a user visits a page by clicking on a hyperlink in an email or a word processing document.
Another exemplary type of heuristic that can be used in the general method 300 shown in
At process block 702, user-activity data is received. The user-activity data typically includes network-access requests (e.g., Web accesses), user-interface events (e.g., keystroke and mouse actions), and the corresponding times at which they occurred. The user-activity data may also comprise user-activity data that has been previously analyzed by another heuristic (e.g., the targeted user-activity data from table 400 in
At process block 704, a network-access request is selected from the user-activity data and compared to known search-engine-query addresses. The network-access request may be selected because it has some recognized format or simply because it is the next network-access request to be considered from the user-activity data. The known search-engine-query addresses may be stored in a list that is compiled empirically and that may be periodically updated to account for newly discovered or released search-engine-query addresses. The search-engine-query addresses relate generally to network-access requests that are recognized by their form to comprise a search-engine query. For example, in the context of a search engine used to search the Web, the search-engine-query addresses correspond to URLs used by known search engines to execute search queries. The Google® search engine, for example, typically uses the URL “http://www.google.com/search?h1=en&ie=UTF-8&q=user+query,” where the words “user+query” in the URL correspond to the terms searched for (here, “user query”). The search terms contained in the URL may be ignored for purposes of matching the URL to the selected network-access request.
At process block 706, the user's query is identified based at least in part on the comparison from process block 704. In general, once the search-engine-query addresses are known, the location of the user query within the address can be identified such that the user query itself can be extracted. In certain embodiments, procedural and declarative codes can be written that parse each of the URLs (using, for instance, a separate pattern for each search engine). Because the formats of these search engine URLs are often unpublished, it may be necessary to reverse engineer the format by manually issuing queries against each engine and observing how the HTTP, GET, and POST requests change as a result. This reverse engineering may be performed manually or automatically. At process block 708, the user's query is output (e.g., in a list of targeted user activities)
At process block 804, the next event is selected from the user-activity data. Though not necessary, the user-activity data may be sorted (e.g., chronologically using the time data).
At process block 806, a determination is made as to whether the selected entry is a Web-access request. This determination can be made, for example, by recognizing the selected entry as a URL address. If the selected entry is not a Web-access request, then a determination is made at process block 812 as to whether the selected event is the last event. If it is not, then the method returns to process block 804, where the next entry from the user-activity data is selected.
If the selected entry is a Web-access request, then at process block 808, a determination is made as to whether the selected entry is found in a list of known search-engine-query URLs (described above). If the entry is not found in the list of known search-engine-query URLs, then the selected entry is presumed to not be a query to a search engine, and the method proceeds to process block 812. If the entry is found in the list, then the user query is identified at process block 810 using the matching URL from the list (e.g., by parsing the query from the selected entry according to the pattern of the matching URL address).
At process block 812, a determination is made as to whether the selected entry is the last entry in the user-activity data received. If it is not, then the method 800 is repeated with the next entry at process block 804. If the selected entry is the last entry, then the user queries and search engines identified are output at process block 814. For example, the queries and the corresponding search engines may be added to a list of targeted user activities (such as the list created at process block 356 of
Beginning with entry 402 from table 400, it is determined that the entry is a Web-access request (process block 806). For purposes of this example, assume that the list of known search-engine-query URLs includes an entry for the Google® search engine (e.g., http://www.google.com/search?h1=en&ie=UTF-8&q= . . . ). Accordingly, the entry 402 is found in the list of known search-engine-query URLs (process block 808) and the user query is identified (process block 810). Specifically, the user query to “ken alibek,” “testimony,” “congress,” and “exporting biotechnology” is identified. Because entry 402 is not the last entry, the method 800 is repeated for entry 404.
For entry 404, the entry is not found to be in the list of known search-engine-query URLs (process block 808). Accordingly, no search engine query is identified. The search engine and user query are then output (process block 814).
The heuristic described above does not necessarily need to operate in the sequence shown above, as certain described operations may in some cases be rearranged or performed concurrently.
Another exemplary type of heuristic that can be used in the general method 300 shown in
At process block 902, user-activity data is received. In this embodiment, it is assumed that the user-activity data received comprises raw, file-activity data as can be detected through a workstation's file- or operating-system sensor 106 (e.g., using system-wide hooking). In one exemplary form, each file-activity event detected identifies at least: (1) the time of the event; (2) the file that was accessed during the event; and (3) the process accessing the file. As used herein, the term “process” refers to an instance of a running program (e.g., an instance of a software application running on the user's workstation).
Typically, a simple file action (such as opening a new file) can generate dozens of low-level, file-activity events (such as accessing temporary or system files or repeatedly accessing the file during execution of the action). These irrelevant and spurious low-level, file-activity events are desirably filtered such that only acts indicative of what file action the user intended to do are listed.
At process bock 904, file-activity events that are known to be irrelevant are removed from the file-activity data. For example, a file-activity event can be removed if it matches an entry in a list of exclusion patterns. An exclusion pattern can comprise any feature or trait of the file-activity event that identifies the event as being one that is related to an irrelevant file. In one implementation of the method 900, for example, any file-activity event related to a file stored in a temporary folder is desirably excluded. Thus, the exclusion patterns might comprise: “*\Temporary Internet Files\*,” “*\Documents and Settings\Temp\*,” or “*\Documents and Settings\ . . . \Application Data\*,” (where the “*” represents a wildcard character). Likewise, the exclusion pattern can be tailored to target and remove file-activity events related to a specific process that is deemed to be irrelevant. Thus, for example, any access of a file performed by an instance of Explorer® (the file-system indexing service used in Windows®) or Internet Explorer® might be removed from the raw file-activity data received.
At process block 906, the remaining file-activity events are clustered together into larger periods of activity. For example, file-activity events that involve the same process and file, and that occur within predetermined time intervals of one another, are combined into a cluster (referred to herein as a “process-file cluster”.) In one exemplary implementation, for instance, file-activity events are aggregated into a common process-file cluster if each event indicates an access by the same process, to the same file, occurring within N seconds (e.g., five seconds) of the previous event in the cluster. The interval of time that may elapse between events to be clustered may vary from implementation to implementation and can be derived empirically, or from statistical analysis or simulation. Conceptually, the process-file clusters collectively represent a single file action that occurred over a period of time. Thus, for the embodiment described above, a process-file cluster can be viewed as indicating that process P accessed file F at time T1, and continued to access the file F at least once every N seconds until the access at time T2, at which time it stopped accessing the file F for at least N seconds.
At process block 908, the process-file clusters are analyzed relative to when the file that was accessed was created and/or last modified. More specifically, a time associated with the process-file cluster (e.g., the time of the first file-access event in the cluster) is compared to the creation and/or modification times of the file accessed. The creation and/or modification time of the file is typically stored by the operating system of the user workstation. For instance, for a workstation using the Windows® operating system, the operating system can be queried once a process-file cluster is created to determine what dates and times the operating system has stored as the “last modification” and “creation” times for the file accessed during the process-file cluster.
At process block 910, the process-file clusters are classified as representing different types of file actions based at least in part on the comparison performed at process block 908. For example, according to one implementation, for a selected process-file cluster, if the first event in the cluster occurred after a threshold time period from the last modification (referred to herein as the “modification-time threshold”), then the file action represented by the process-file cluster is classified as an “opening” action (that is, the cluster is deemed to represent the opening of a file). On the other hand, if the first event in the cluster occurred within the modification-time threshold, then the process-file cluster is classified as either representing a “creation” action or a “modification” action. Specifically, if the first event of the cluster occurred after a threshold time period from the creation of the associated file (referred to herein as the “creation-time threshold”), then the file action represented by the cluster is classified as a “modification” action; otherwise, the file activity is classified as a “creation” activity.
At process block 912, the file actions performed during the process-file clusters are output. For example, in certain embodiments, the file actions are included in a list of targeted file actions, which can be combined with other user activities into a single list of targeted user activities.
Returning to process block 1008, a determination is made as to whether the file-activity event occurred within a specified period of time of the previous file-activity event or cluster identified as having the same process and file (measured, for example, from the last event in the cluster). The period of time used at process block 1008 may vary from implementation to implementation, but in one exemplary implementation is five seconds. If the file-activity event did occur within the specified period of time, then, at process block 1010, the event is combined with the previous file-activity event or cluster. That is, if the selected event occurred within the specified period of time of a previous matching event, then the two events are combined into a single process-file cluster; and if the selected event occurred within the specified period of time of a previous matching cluster, then the selected event is added to the cluster. The method 1000 then proceeds to process block 1012, where the cluster is associated with a window.
At process block 1014, a determination is made as to whether any events or process-file clusters are ready to be classified. In certain implementations, an event or cluster is ready for classification when it has been unchanged for a fixed period of time (e.g., five seconds). For example, when a cluster has had no new file-activity events added to it for a period of five seconds, it is deemed to be complete and ready for classification. Process block 1014 can be performed substantially continuously (e.g., at constant intervals) during execution of the method 1000. When an event or cluster is ready to be classified, the method 1000 proceeds to process block 1020 shown in
At process block 1020, the creation and modification times for the file associated with the event or process-file cluster to be classified are determined. This information is typically stored by the operating system of the user's workstation and can be obtained by querying the operating system. At process block 1022, a determination is made as to whether the event or the cluster occurred within a modification-time threshold (e.g., five seconds) of the modification time obtained at process block 1020. If so, then the method 1000 continues at process block 1026; otherwise, a determination is made as to whether the event or cluster is associated with a window (as was determined at process block 1012). If the event or cluster is associated with a window, then the file action represented by the event or cluster is designated as being an “opening” action (i.e., representative of the user opening a file). This classification, as well as other information concerning the file action, can then be output (e.g., in a list of targeted file actions or a list of targeted user activities) and the method can return to process block 1002 of
Returning to process block 1026, a determination is made as to whether the event or cluster occurred within a creation-time threshold (e.g., one minute) of the creation time obtained at process block 1020. If so, then the file action represented by the event or cluster is classified as a “creation” action (i.e., representative of the user creating a file); otherwise, if the event or cluster occurred after the creation-time threshold, then the file action is output as a “modification” action (i.e., representative of the user modifying a file).
The heuristic described above does not necessarily need to operate in the sequence shown above, as certain described operations may in some cases be rearranged or performed concurrently. Moreover, the particular titles of the various lists and flags described above should not be construed as limiting, as they may change from implementation to implementation. Additionally, the heuristic can be modified in several respects to identify other types of targeted file actions. For example, an “inclusion” list may be utilized to record events that would be classified as “open” events were they associated with a window title change.
The first entry 1110 in the table 1100 occurred at 17:36:55.513 and was followed by numerous other accesses (represented by entry 1111 and the subsequent ellipses) until entry 1112 at 17:36:55.919. Then, as shown in entry 1120, the file was accessed again at 17:37:28.341, after which time numerous additional accesses to the file occurred (entry 1121 and the subsequent ellipses) until entry 1122 at 17:37:28.372. The next file access is shown in entry 1130 as occurring at 17:37:35.122, which was followed by numerous additional accesses (entry 1131 and the subsequent ellipses) until entry 1132 at 17:37:35.513. The additional file accesses that are represented by the ellipses are typically numerous in quantity and comprise a large amount of file-activity data that is desirably grouped together or ignored by the heuristic.
Beginning with the first file-activity event in entry 1110, it is determined that the event is not related to any excluded files (process block 1004) and does not involve the same file or process as any previous file-activity event because it is the first file-activity event (process block 1006). An evaluation is made as to whether the event is associated with a window (process block 1012). This evaluation can be performed, for example, by monitoring for any window title changes that occur within a fixed amount of time of the selected event (e.g., within two seconds). Assume for purposes of this example that the event is associated with a window title change. That is, assume that the name of the file accessed (“LoggingArchitecture7”) matches the name in a title bar of a window that was changed near the time of the selected event (e.g., within two seconds). A determination is made as to whether any events or clusters are ready to be classified (process block 1014). Assume for purposes of this example that events or clusters are to be classified if they have not been combined with any other events for more than five seconds. Thus, at this point in the example, no events or clusters are ready to be classified.
When the next entry 1111 is received, it is determined that the event is not related to any excluded files (process block 1004) and that the event involves the same file and process as a previous file-activity event or cluster, namely event 1110 (process block 1006). It is also determined that the event at entry 1111 occurred within a specified period of time of the previous event, which is assumed to be five seconds or less for purposes of this example (process block 1008). Thus, the event at entry 1111 is grouped into a common process-file cluster with entry 1110 (process block 1010). A window is already associated with the cluster (process block 1012), and no event or cluster is ready yet for classification (process block 1014).
The method 1000 continues to build the first process-file cluster until entry 1112. Five seconds after the first process-file cluster is complete, it is determined that the first cluster is ready for classification because it has not been combined with any other event for the specified period of time (process block 1014). Turning now to
The method 1000 continues in this manner for the next entries (entries 1120-1122) and builds a second process-file cluster. When the second cluster is ready to be classified (process block 1014), the operating system is again queried for the creation time and last modification time. Assume now that the creation time is unchanged, but that the last modification time is Aug. 13, 2004, 17:37:28.350. Thus, the first file-activity event in the second cluster (entry 1120) occurred within a modification-time threshold (process block 1022), but after a creation-time threshold (process block 1024). Thus, the second process-file cluster is classified and output as indicating the “modification” of the file (process block 1034).
For the next entries (entries 1130-1132), a third process-file cluster is built. Assume now that no window is associated with this third process-file cluster (that is, no window title change is found to have occurred within two seconds of any of the entries 1130-1132). When the third process-file cluster is ready for classification (process block 1014), the operating system is again queried for the creation and modification time of the file. Assume that the creation and modification time is unchanged from when it was queried for the second cluster. Thus, the first file-activity event in the third cluster did not occur within the modification-time threshold (process block 1022) and is not associated with a window (process block 1024). Consequently, no file action associated with the third process-file cluster is output. For example, the file may have been moved or deleted, events that the exemplary method 1000 does not record.
The heuristic described above does not necessarily need to operate in the sequence shown above, as certain described operations may in some cases be rearranged or performed concurrently.
Another exemplary type of heuristic that can be used in the general method 300 shown in
At process block 1302, the network responses to the network-access requests from a user's workstation are monitored. At process block 1304, a network response directing a window title change is identified. For example, the network responses being monitored can be searched to determine whether they contain any directives to change a window title on the user's workstation. For example, in the context of monitoring a user's Web browser activity, the directive might comprise an HTML field that prompts a window title change in the user's Web browser (e.g., “<title> . . . </title>”).
At process block 1306, a window having a title that changed within a selected period of time of the network response directing the window title change is identified. For example, the data being received by an operating-system sensor (e.g., using system-wide hooking) can be monitored to see if a window title on the user's workstation changed within a selected period of time from receipt of the identified network-response (e.g., two seconds).
At process block 1308, the title of the window identified is compared to the title directed by the identified network-response. If the titles match, then the window is associated with the network-access request that produced the identified network response.
In one particular embodiment, after this association is made, the user can point to an active window on their workstation and have the network-access request (e.g., the URL address) associated with the window be included as part of any user-activity data recorded. The user may additionally be able to insert additional comments concerning the network-access request, which also becomes part of the user-activity data recorded. For instance, an operating-system sensor can be used to monitor a user's pointer-device (e.g., mouse) coordinates on a screen and to identify the window to which the user is pointing. The network-access request associated with this window can then be displayed to the user or recorded as part of the user-activity data, and, in some embodiments, the user can enter additional information about their activities related to the associated network-access request. As part of this feature, for instance, the user can point to a window and select to make a note about the contents in the window. Because the window can be associated with a particular network-access request using the general method 1300, the user's commentary can be associated not just with a particular window and window title, but with a particular network-access request (e.g., a URL address).
The first entry in the table 1500 corresponds to a window title change (and is indicative of a search being performed on the Google® search engine for the terms: “cnn,” “rice,” “commission,” and “testimony”). As each of the next few network-access requests is made, the network response thereto is monitored (process block 1402) and checked to determine whether it includes a title-change directive (e.g., “<title> . . . </title>”) (process block 1404). For purposes of this example, assume that the Internet response to entry 1510 has HTML with the following title directive: “<title>CNN.com—Rice delivers tough defense of administration—Apr. 8, 2004</title>.” Thus, when the Internet response to entry 1510 is received (process block 1402), a title-change directive is found (process block 1404), and any window-change events within a specified period of time (e.g., two seconds) are found (process block 1406). In this example, two window-change events occurred within the specified period of time: the change to
“http://www.cnn.com/2004/ALLPOLITICS/04/08/911.commission/” at entry 1512 and the change to “CNN.com—Rice delivers tough defense of administration—Apr. 8, 2004” at entry 1514. (In this case, the window title first changed to the URL address being accessed as part of the standard operation of the Web browser, not as a result of a title directive.) The two titles are evaluated to determine whether they match the title in the title directive (process block 1408). Consequently, the window title change at entry 1514 is matched to the title directive (“CNN.com—Rice delivers tough defense of administration—Apr. 8, 2004”) and is associated with the URL address from entry 1510, which prompted the title change directive.
In the exemplary implementation illustrated in
The heuristic described above does not necessarily need to operate in the sequence shown above, as certain described operations may in some cases be rearranged or performed concurrently.
Any of the aspects of the technology described above may be performed on a single computer workstation or using a distributed computer network. An example of a distributed computer network according to one embodiment is shown in
Having illustrated and described the principles of the illustrated embodiments, it will be apparent to those skilled in the art that the embodiments can be modified in arrangement and detail without departing from such principles. Those skilled in the art will recognize that the disclosed embodiments can be easily modified to accommodate different situations and applications.
In view of the many possible embodiments, it will be recognized that the illustrated embodiments include only examples and should not be taken as a limitation on the scope of the disclosed technology. Rather, the disclosed technology comprises all novel and non-obvious features and aspects of the various disclosed embodiments and their equivalents, alone and in various combinations and sub-combinations with one another.
This application claims the benefit of U.S. Provisional Application No. 60/571,001, filed May 14, 2004, which is incorporated herein by reference.
This invention was made with Government support under a contract awarded by an agency of the United States Government. The Government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
60571001 | May 2004 | US |