Aspects of the present disclosure relate to web site performance and web transactional data collection, cleansing, aggregation, and analysis to generate business and operational intelligence through real-time analytics.
E-commerce providers hosting web sites, or providing services for web merchants, and web merchants themselves, are interested in finding new ways to attract and keep online customers and protect their systems from data breach and other issues. Intelligence related to web site traffic and customer behavior on a web site can provide key insights into the customer's preferences, determine how application performance affects a customer's behavior and provide early indication of issues that may drive low conversion rates, indicate poor website health or indicate possible fraud. Reporting on data collected during an online user experience is typically time delayed, sometimes making the knowledge that can be gleaned from data outdated by the time a client receives it.
A real-time data feed allows a web merchant to monitor the health of the web site, to monitor flash sales and extensive A/B tests, and to use real time data internally for inventory and fulfillment. Real-user monitoring performed on web sites provides key information regarding the health of a website. A real-time data feed allows the web site administrator to discover and address problems and issues as they are manifested on the site in real-time and take corrective action to minimize cart or web site abandonment, avoid losses due to fraud, prevent application and operational issues, prevent compliance violations and optimize web site content and offers.
The system and method disclosed herein give actionable business and operational intelligence to the client so that they can optimize their customers buying experience and also be able to put hard numbers around the changes that they make. The overall combination of real user monitoring, cart creation and visit details, along with payment processing details allows clients to track over time how changes are not only affecting sales, but the entire shopping experience.
By monitoring the performance of close rates and page performance over time, web platforms can analyze where possible improvements can be made and more importantly have metrics and numbers around the changes they do make, so they can verify and validate their effectiveness. For payment processing systems, it allows risk and compliance to highlight and investigate areas that have possible issues before losses or data issues can occur.
Systems and methods providing real-time web analytics are disclosed. One embodiment features data source or client, data processing and analytics devices and workflow, and a data science system. Embodiments of the disclosed system and method provide web and other event-based analytics in real-time. A client may receive a request for an event initiated by a user and publish it to the analytics processing platform. The client may append additional data to the message and transform it into a JSON format prior to publishing the request on a message bus. Raw messages are captured in a real-time data message processing queue, scrubbed based on source data requirements and republished to topic queues in a message bus for further consumption.
The message is extracted from the queue and written to a message database, creating a document record for the message. This raw message data is available for immediate viewing and analysis. Aggregate processing programs copy the message and aggregate the new message with existing message records. Data metrics programs are run on the newly aggregated data and the results are written to an aggregated data database. Comma separated value (.csv) files are created with the updated aggregated data and loaded into a reporting database with a graphical user interface that presents counts, statistics, and graphical representations to interested clients. The system uses components that are optimized for use with large amounts of streaming data over a highly distributed environment and provide results to the client within real-time parameters.
The system components described herein provide a highly flexible and scalable real-time data collection and analysis system providing actionable business and operational intelligence to ecommerce platforms.
Embodiments of the present invention may be described more fully hereinafter with reference to the accompanying drawings, in which some, but not all, embodiments of the invention are shown. The invention may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that the disclosure may enable one of ordinary skill in the art to make and use the invention. Like numbers refer to like components or elements throughout the specification and drawings.
Embodiments of the invention are directed to systems and methods for providing real-time web and transaction analytics. According to the systems and methods of the present disclosure, a real-time web analytics system consumes data from a variety of data sources, processing the data through a plurality of applications that may be developed on top of Open Source technology such as Apache™ Kafka, Apache™ Hadoop, MongoDB, HDFS, Hive, Apache™ Spark, and others. These technologies provide an inexpensive, highly performant environment for streaming applications such as a Real-time Web Analytics System and Method.
In this disclosure, the term “client” refers to a source or consumer of the data processed by the disclosed system. A “user” refers to an individual, operating a computing device and initiating the type of events being consumed by the system. For example, a payment processing platform is a client; the individual making an online payment is a user. An ecommerce system hosting web pages is a client; the individual accessing the web pages is a user. “User” may be used synonymously with “customer.” A use case may be developed for each client defining their use of a particular embodiment. Input and output data, system configurations and data aggregation and metrics programs may be client specific.
Embodiments of the present invention are described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products. It may be understood that each block of the flowchart illustrations and/or block diagrams, and/or combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general-purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create mechanisms for implementing the functions or acts specified in the flowchart and/or block diagram block or blocks.
Computer program instructions may also be stored in a non-transitory computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture including instruction means which implement the functions or acts specified in the flowchart and/or block diagram block(s).
The computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer-implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions/acts specified in the flowchart and/or block diagram block(s). Alternatively, computer program implemented steps or acts may be combined with operator or human implemented steps or acts in order to carry out an embodiment of the invention.
As was mentioned above, clients 102 of a real-time web analytics system and method may generate data received by API, typically a REST API 104 where the client may be a payment processing system or ecommerce platform; created by log messages 106 generated from pixel tracking of a user's experience with a web site; or loaded into the system from a database 108, which may use an extract, transform and load tool 110. As a transaction or message is received, it is immediately published to the message bus 112.
Referring again to
Apache Kafka™ is an open source distributed streaming platform/message bus that is implemented in clusters consisting of one or more servers (i.e. Kafka brokers) running an instance of Kafka. Zookeeper maintains meta data about the broker, topics (queues) within the broker, partitions within topics, clients, and other information required to run Kafka. Producers, or publishers, publish JSON messages to designated topics or queues, where they are pulled by consumers. In a preferred embodiment of this disclosure, data source clients are producers, as is data quality and any process that writes message data that will be subsequently pulled by another process. Topics or queues, are provided for raw messages and data quality messages that have updated the raw message. Consumers pull messages using nextMessage, each consumer having been assigned a number of partitions on a particular queue. Consumers in a preferred embodiment include data quality, ramps, and flume which pull the messages using a nextMessage class from assigned partitions, giving the system its scalability.
Data quality processing framework modules 114 comprising program code and stored in server memory, define input-output message parameters and filters for the message bus 112. Input-output parameters direct messages to a particular queue or storage location (or topic, in Kafka) so it available for future consumption. Filters may enhance a message by providing rules regarding data to append to a certain type of message, data cleansing rules, etc. and allow the system to grab subsets of data to publish back out. Filters may be stacked for serial application. A data quality may include in-memory storage stables that include auxiliary data, including look up tables for data standardization and aggregation and resources such as currency conversion tables. When applying a filter, the data quality processing framework may access an in-memory database or additional modules not shown in
Following processing through the data quality framework module 114, processed messages may be written back to a new queue in the message bus 116 and may be extracted from there by any system that can consume the data. In particular, message data may be extracted by a raw message long term storage data store 120. Raw messages may be extracted from the data store 120 as they come in and are processed by aggregation programs 122 that append the message to previously processed messages and recalculate the reporting statistics.
Illustrated in
Returning to
Raw message data in short term storage is processed through a series of data aggregation processes 122. Each message is extracted and aggregated with the previously processed messages and metrics may be calculated. Aggregated data may then be moved to an aggregated data store such as MongoDB AGG 128. Data stored in HDFS 118 may be processed through a data processing engine such as Apache Spark™ 126 and the resulting aggregated data and metrics may be written to the MongoDB AGG 128 as well.
Comma Separated Value (.csv) files 130 are created from the processed data in MongoDB AGG 128, which may be moved, using an ETL tool such as Informatica, to a relational data base 132, where it may be accessed by web applications with a graphical user interface capable of displaying data statistics and graphics, for example, a home-grown business intelligence interface 134, Hyperion Essbase 136, or Oracle Business Intelligence Enterprise Edition (OBIEE) 138.
A data science system, consisting of tools or modules containing program code for calculating and displaying data for very large numbers of messages across many clusters of computers may also consume this data for added business intelligence. Tools such as Apache Spark 140 and Zeppelin 142 are exemplary tools that may be used for this purpose.
As was mentioned above, data can come from nearly any type of client or source 102, including API transactions from commerce, payment, or other transactional platforms 104, web user monitoring from a website hosting platform 106, and ETL transactions 110 from any database or file source 108. Real user monitoring (RUM) captures web traffic data and stores it in a message log storage tool. In one embodiment, beacon technology is used to collect user monitoring data using event-based tracking. A beacon may be programmed to collect data regarding a type of event, the site ID, the visitor ID, page type, date, first byte, page load and other measurements. The tracking program may be added to any web page.
An exemplary event-based web data collection process may use tools such as the open source product Boomerang or similar. Referring to
Referring back to
Data quality rules are stored in a highly available in-memory (such as Redis, a product of Redislabs) database in the data quality module, which may be accessed by database and key, and include look up tables for data standardization and aggregation and for resources such as currency conversion tables. Two examples of rules that may be applied are (1) a list of rules used for stripping personal identifying information (PII) from a payment processing transaction and (2) currency conversion from or to USD, given the currency and date. These tables may be updated daily. In a preferred embodiment, data quality filters are written in scala. A filter is a trait in scala, similar to an interface and base class in java. A filter implementation class implements a runFilter function which accepts a string as a parameter and returns a string. Base functionality handles reading and writing the strings from message queues. Multiple filters can be configured for a message stream. This means we can apply many filters on a message that we read from the message bus before publishing it back out. Filters are fault tolerant. If there is an issue, the message will not be lost. Traits (filters) are used to allow multiple ways to ingest or write data, including reading and writing to the Kafka message bus 112, 116. They use the nextMessage class and write as primary function so can easily be adapted to other message buses or even databases.
The data quality framework may provide any number of filters. They are defined and applied based on the type of data that is being collected and the requirements of the client. Table 1 below provides a list of exemplary filters that may be applied to the data source clients described herein. Table 2 provides an example of a geo-enrichment filter written in scala.
As was described above, embodiments of the real-time data analytics system and method may apply a data aggregation module 122 to the raw message/transaction data 120 in order to derive business intelligence 132-138 to monitor the performance of a system or the integrity of incoming transactions. A data aggregation module 122 comprises computer programs, stored in server memory, which when executed by the server processor perform various functions of aggregation and calculation on an incoming message. Data aggregation programs 122 run continuously to append a new, cleansed message to existing aggregating data. Metrics calculation programs create the statistics of interest by performing the desired metrics calculation programs against the data that now includes a new message or messages. Metrics may be calculated for a time period (hour, day, week) for any piece of data collected from the data source. For example, client_id, site_id, locale, page type, user browser type, user operating system, device type, and more. Table 3 below provides some exemplary aggregation and metrics calculation programs that are provided by a preferred embodiment of the disclosed system and methods.
Aggregated data and calculated metrics are stored in a database, such as MongodB 128. As each new message flows through the system, creating new aggregated data and new metrics, database records are extracted and .csv files 130 are created from the extracted data. An ETL tool, such as Informatica, may be used to load these records into a relational reporting database 132. Data is presented to a user accessing a graphical user interface of a business intelligence system 138, such as Oracle's Business Intelligence system OBIEE or other interface tools which can access the reporting database.
Transaction data may be optionally extracted from the primary data center message bus 314 and stored in HDFS 316 and HIVE 318. The transaction message data is further consumed by MongoDB 320 for long term storage and further processing. The message data is extracted from the MongoDB message database 320 and processed through a number of python aggregation jobs 322 which aggregate data and compute statistics, such as those described in Table 3, above. Aggregated and statistical data are stored in a MongoDB AGG datastore. Comma Separated Value (.csv) files are created 326, which are loaded 328 into oracle 330 or reporting/viewing through OBIEE 432. The latest message data received by the system will be in the aggregated statistics within less than a few milliseconds. Aggregated metrics are available the following hour, day, week or month, depending on the granularity of the data.
Tables 4 and 5 below provide some of the metrics that would be of value to a payment processing platform, and some notes on those metrics, respectively.
1Total
2Transactions
3Submitted
4Authorizations
5Successful
6Capture
7Unsuccessful
8Capture
Referring again to
Clients of an ecommerce system may access the ELK stack 140 for real-time data. Real-time operational performance data provides key insights into the health of the system and allows the ecommerce provider to make adjustments as issues arise, and to associate user behavior with web site performance.
In addition to real-time operational performance data, the ecommerce system may collect information regarding cart creation and visit details from the API 104 requests made from the user to the ecommerce system. In addition to the bounce rate (statistics on the page at which a user leaves) and exit analysis of the RUM data 106, the API request provides data that gives clients an insight into the cart funnel (the customer's path to conversion) which clients have not had access to previously. By analyzing an entire visit which has been captured in a document in the Mongo 120, 128 database, the client can analyze what steps are causing a customer confusion, what elements might be altering the customer's behavior during checkout or signup and what technical nuisances arise during the experience—in other words, the entire customer experience can be analyzed.
By viewing and analyzing this data, clients are able to detect mounting technical problems and take quick action to minimize the impact by analyzing data in real-time. For example, a web store client monitoring page load data found load times quickly deteriorating. Recent changes to the page, indicated that heavy graphics had been added to the web store catalog and loading the page for the particular product was causing customers to abandon the page before it had completed loading.
While certain exemplary embodiments have been described and shown in the accompanying drawings, it is to be understood that such embodiments are merely illustrative of and not restrictive on the broad invention, and that this invention not be limited to the specific constructions and arrangements shown and described, since various other updates, combinations, omissions, modifications and substitutions, in addition to those set forth in the above paragraphs, are possible.
The steps and/or actions of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, a hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium may be coupled to the processor, such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. Further, in some embodiments, the processor and the storage medium may reside in an Application Specific Integrated Circuit (ASIC). In the alternative, the processor and the storage medium may reside as discrete components in a computing device. Additionally, in some embodiments, the events and/or actions of a method or algorithm may reside as one or any combination or set of codes and/or instructions on a machine-readable medium and/or computer-readable medium, which may be incorporated into a computer program product.
In one or more embodiments, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored or transmitted as one or more instructions or code on a computer-readable medium. Non-transitory computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage medium may be any available media that can be accessed by a computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures, and that can be accessed by a computer.
Computer program code for carrying out operations of embodiments of the present invention may be written in an object oriented, scripted or unscripted programming language such as Java, Scala, Perl, Smalltalk, C++, or the like. However, the computer program code for carrying out operations of embodiments of the present invention may also be written in conventional procedural programming languages, such as the “C” programming language or similar programming languages.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture including instruction means which implement the function/act specified in the flowchart and/or block diagram block(s).
The computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer-implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions/acts specified in the flowchart and/or block diagram block(s). Alternatively, computer program implemented steps or acts may be combined with operator or human implemented steps or acts in order to carry out an embodiment of the invention.
Those skilled in the art may appreciate that various adaptations and modifications of the just described embodiments can be configured without departing from the scope and spirit of the invention. Therefore, it is to be understood that, within the scope of the appended claims, the invention may be practiced other than as specifically described herein.
This application claims the benefit of U.S. Provisional Application No. 62/511,366 filed 25 May 2017, entitled “Real Time Web Analytics System,” which is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
62511366 | May 2017 | US |