The present disclosure generally relates to data analysis and visualization, and in particular, generating a heat map based on temporal information and user clusters.
A heat map is a graphical representation of data where the values at any given intersection or data point on a two-dimensional graph are represented as colors or other graphical symbols. A heat map may be used an outliner-detection-visualization tool that can be performed on each specified unit for a large number of selected tags across many different time points. A heat map illustrates the anomaly-intensity and the direction of a ‘target observation.’ A heat map may also contain a visual illustration of alerts, and directs immediate attention to hot-spot sensor values.
Business intelligence (BI) is a business management term that refers to applications and technologies that are used to gather, provide access to, and analyze data and information about business operations. Business intelligence systems can help companies obtain more comprehensive knowledge of the factors affecting their business, such as metrics on sales, production, internal operations, and make better business decisions.
The present invention provides methods, apparatuses and systems directed to generating heat maps that facilitate analysis of user activity. In particular embodiments, a heat map represents activity intensity of time-based cohort groups over time. These and other features, aspects, and advantages of the disclosure are described in more detail below in the detailed description and in conjunction with the following figures.
The invention is now described in detail with reference to a few embodiments thereof as illustrated in the accompanying drawings. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. It is apparent, however, to one skilled in the art, that the present disclosure may be practiced without some or all of these specific details. In other instances, well known process steps and/or structures have not been described in detail in order not to unnecessarily obscure the present disclosure. In addition, while the disclosure is described in conjunction with the particular embodiments, it should be understood that this description is not intended to limit the disclosure to the described embodiments. To the contrary, the description is intended to cover alternatives, modifications, and equivalents as may be included within the spirit and scope of the disclosure as defined by the appended claims.
Business intelligence (BI) is a business management term that refers to applications and technologies that are used to gather, provide access to, and analyze data and information about business operations. Business intelligence systems can help companies have a more comprehensive knowledge of the factors affecting their business (such as metrics on sales, production, and internal operations), spot trends, and make better business decisions. Business intelligence applications and technologies can enable organizations to make more informed business decisions, and provide a competitive advantage. For example, a company could use business intelligence applications or technologies to extrapolate information from indicators in the external environment and forecast the future trends in their sector. Business intelligence is used to improve the timeliness and quality of information and enable managers to better understand the position of their company in comparison to its competitors. Business intelligence applications and technologies can help companies analyze the following: changing trends in market share, changes in customer behavior and spending patterns, customers' preferences, company capabilities and market conditions. Business intelligence can be used to help analysts and managers determine which adjustments are most likely to affect trends.
Data visualization may be an aspect of business intelligence applications. Data visualization generally refers to the visual representation of data or information which has been abstracted in some schematic form, including attributes or variables for units of information. A heat map is one data visualization technique. It is a graphical representation of data where the values at any given point (represented, for example, as x- and y-coordinates) in a two-dimensional or three-dimensional surface are represented as colors, gray scale or other intensity values. In other words, the value at each point maps to a corresponding color, gray scale or other graphical encoding value (e.g., from black to blue to green to red to yellow and to white). The graphical encodings or indications provided by the different pixel color intensities and the overall visual representation of the data allow for assessments of various data along multiple axes. In monitoring and diagnostics, a heat map is highly useful and revolutionary for monitoring and diagnostics. A heat map illustrates the anomaly-intensity and the direction of a ‘target observation.’ Heat maps can also provide marketing opportunities on the fly with great accuracy across different time scales such as per second, minute, hour, day, and the like. The method, as embodied by the patent invention is particularly useful when applied to functional time-based cohorts.
To facilitate a view of temporal information for purposes of trend evaluation, a heat map can be generated in which a first axis corresponds to a grouping or cluster of users, and a second axis corresponds to units of time. In statistics and demography, a cohort is a member of a group that share one or more attributes in common, such as age, location, income level and the like. Cohorts may be tracked over periods of time in order to reveal trends and other aggregate behaviors. The graphical encoding at each point in the heat map may indicate a ratio or percentage of the users in each cluster that satisfy a set of criterion. The set of attributes that are used to define each cluster may also be time-based, such as the day an event associated with a user occurred (e.g., the date of first registration, the date a user first clicks on a given web page or ad, etc.). In this way, a viewer can monitor activities and trends between and across these cohort groups without using multiple two dimension graphs.
To analyze trends in some metric over time (such as the percentage of users from a given cohort group) that are sill active after a certain number of days since first registration, a graph with a function of one variable (see
In
The line graph 304 under the heat map is a time series of the number of users in each cohort group—in one implementation, the number of users who confirmed their accounts on each day. The line graph 304 provides context for the volume of users in each cohort group. As discussed above, the triangle-shape of the plot 316 results because users in more recent cohort groups (increasing values of x) 320 have not been on the site long enough to provide data as y values 322 increase beyond the total number of days since a given cohort group first registered with the web site. In addition, the same calendar day for each cohort group along the y-axis is shifted by one day. Accordingly, the state of all cohort groups on a given day can be assessed by running a diagonal line 314.
The heat map of
Vertical color patterns 310 represent attributes associated with a particular cohort group. For example, the heat map reveals that the cohort groups corresponding to December 2007 to approximately April 2008 exhibited roughly similar activity patterns, while subsequent cohort groups behaved differently and remained more active. From the heat map, a user may also be able to discern diagonal color patterns or lines (upper-left to lower-right) 314 that represent a particular calendar date. For example, diagonally oriented lines or patterns can reveal events or trends that are observed across different cohort groups independent of tenure. Where such a diagonal line intersects the x-axis in the example heat map illustrated in
To generate the graph above
The graph generating process may also join the registration table of
The graph generating process may, for each user and cohort group, perform a stepwise scan of the bit arrays for each entry to identify whether a user satisfies an “Active 30” condition (456). As discussed above, an Active 30 user is a user that, relative to day X, was active at least one day in the 30 days preceding day X. Accordingly, to detect whether a user satisfies this condition for a series of days, the graph generating process may use a 30 day or bit window. If there is at least one “1” value in the current window, the “Active 30” condition is satisfied for that day. The graph generating process may increment a counter value for that day and then advance the scan window by one bit position and repeat the evaluation until the end of the bit array is reached. As discussed above, this process is performed for all users and cohort groups. The graph generating process uses the resulting values to generate a visual representation of the heat map, such as that illustrated in
The implementation described above describes how cohort groups are based on dates of first registration and an evaluation of user activity against an active 30 condition. The invention has application to a wide variety of analysis scenarios. For example, cohort groups may be defined by other time-based criterion and events. For example, the time base criterion can be the date of any activity or event associated with a user, such as the day a user was first presented with (or first clicked on an URL corresponding to) an advertisement (or advertising campaign), the date a user first expressed interest in a given section of a web site or a particular page, the date a user first made a purchase in a physical retail or web-based store, the date a user first utilized a new feature of a web site, the date a user first opted-in to a service or promotion, and the like.
Furthermore, the evaluation of user activity can also vary considerably. For example, the user activity can be evaluated against an “Active 15”, Active 7 or “Daily Active” basis. Furthermore, the activities assessed can be generally defined as any activity associated with a web site or other entity, or specific activities (such as use of particular features, access of particular web pages, purchase activity and the like). Furthermore, the activity values at each intersection can also vary. In the implementation discussed above, each intersection point corresponds to a ratio or percentage of active users in a given cohort group. In other implementations, other types of activity can be quantified. For example, the values at each intersection point may represent the aggregate number of page views, the aggregate data bytes transferred, aggregate purchase amount activity and the like.
As described herein, the heat map-generating process can be implemented as a series of computer-readable instructions, embodied on a data storage medium, that when executed are operable to cause one or more processors to implement the operations described above. For smaller datasets, the operations described above can be executed on a single computing platform or node. For larger systems and resulting data sets, parallel computing platforms can be used. For example, the operations discussed above can be implemented using Hive to accomplish ad hoc querying, summarization and data analysis, as well as using as incorporating statistical modules by embedding mapper and reducer scripts, such as Python or PerI scripts that implement a statistical algorithm. For example, Fisher's exact test or other statistical algorithm can be implemented as a Python script, which as shown above can be called using a TRANSFORM clause. Other development platforms that can leverage Hadoop or other Map-Reduce execution engines can be used as well.
The Apache Software Foundation has developed a collection of programs called Hadoop (named after a toddler's stuffed elephant), which includes: (a) a distributed file system; and (b) an application programming interface (API) and corresponding implementation of MapReduce.
Multiple nodes also facilitate the parallel processing of large databases. In some embodiments of the present invention, a master server, such as 522a, receives a job from a client and then assigns tasks resulting from that job to slave servers or nodes, such as servers 522b, which do the actual work of executing the assigned tasks upon instruction from the master and which move data between tasks. In some embodiments, the client jobs will invoke Hadoop's MapReduce functionality, as discussed above.
Likewise, in some embodiments of the present invention, a master server, such as server 522a, governs a distributed file system that supports parallel processing of large databases. In particular, the master server 522a manages the file system's namespace and block mapping to nodes, as well as client access to files, which are actually stored on slave servers or nodes, such as servers 522b. In turn, in some embodiments, the slave servers do the actual work of executing read and write requests from clients and perform block creation, deletion, and replication upon instruction from the master server.
While the foregoing processes and mechanisms can be implemented by a wide variety of physical systems and in a wide variety of network and computing environments, the server or computing systems described below provide example computing system architectures for didactic, rather than limiting, purposes.
The elements of hardware system 600 are described in greater detail below. In particular, network interface 616 provides communication between hardware system 600 and any of a wide range of networks, such as an Ethernet (e.g., IEEE 802.3) network, a backplane, etc. Mass storage 618 provides permanent storage for the data and programming instructions to perform the above-described functions implemented in the servers 522a, 522b, whereas system memory 614 (e.g., DRAM) provides temporary storage for the data and programming instructions when executed by processor 602. I/O ports 620 are one or more serial and/or parallel communication ports that provide communication between additional peripheral devices, which may be coupled to hardware system 600.
Hardware system 600 may include a variety of system architectures; and various components of hardware system 600 may be rearranged. For example, cache 604 may be on-chip with processor 602. Alternatively, cache 604 and processor 602 may be packed together as a “processor module,” with processor 602 being referred to as the “processor core.” Furthermore, certain embodiments of the present invention may not require nor include all of the above components. For example, the peripheral devices shown coupled to standard I/O bus 608 may couple to high performance I/O bus 606. In addition, in some embodiments, only a single bus may exist, with the components of hardware system 600 being coupled to the single bus. Furthermore, hardware system 600 may include additional components, such as additional processors, storage devices, or memories.
In one implementation, the operations of the heat map generating process described herein are implemented as a series of executable modules run by hardware system 600, individually or collectively in a distributed computing environment. In a particular embodiment, a set of software modules and/or drivers implements a network communications protocol stack, parallel computing functions, heat map generating processes, and the like. The foregoing functional modules may be realized by hardware, executable modules stored on a computer readable medium, or a combination of both. For example, the functional modules may comprise a plurality or series of instructions to be executed by a processor in a hardware system, such as processor 602. Initially, the series of instructions may be stored on a storage device, such as mass storage 618. However, the series of instructions can be stored on any suitable storage medium, such as a diskette, CD-ROM, ROM, EEPROM, etc. Furthermore, the series of instructions need not be stored locally, and could be received from a remote storage device, such as a server on a network, via network/communications interface 616. The instructions are copied from the storage device, such as mass storage 618, into memory 614 and then accessed and executed by processor 602.
An operating system manages and controls the operation of hardware system 600, including the input and output of data to and from software applications (not shown). The operating system provides an interface between the software applications being executed on the system and the hardware components of the system. Any suitable operating system may be used, such as the LINUX Operating System, the Apple Macintosh Operating System, available from Apple Computer Inc. of Cupertino, Calif., UNIX operating systems, Microsoft (r) Windows(r) operating systems, BSD operating systems, and the like. Of course, other implementations are possible. For example, the heat map generating functions described herein may be implemented in firmware or on an application specific integrated circuit.
Furthermore, the above-described elements and operations can be comprised of instructions that are stored on storage media. The instructions can be retrieved and executed by a processing system. Some examples of instructions are software, program code, and firmware. Some examples of storage media are memory devices, tape, disks, integrated circuits, and servers. The instructions are operational when executed by the processing system to direct the processing system to operate in accord with the invention. The term “processing system” refers to a single processing device or a group of inter-operational processing devices. Some examples of processing devices are integrated circuits and logic circuitry. Those skilled in the art are familiar with instructions, computers, and storage media.
The present invention has been explained with reference to specific embodiments. For example, while embodiments of the present invention have been described as operating in connection with a social network system, the present invention can be used in connection with any communications facility that allows for communication of messages between users, such as an email hosting site. In addition, while some embodiments have been described as analyzing wall posts, other message channel types, such as email, can also be considered in addition to, or in lieu of, wall posts. Still further, the heat map generating process described above can be made accessible to external systems via a set of application programming interfaces. Other embodiments will be evident to those of ordinary skill in the art. It is therefore not intended that the present invention be limited, except as indicated by the appended claims.