1. Field
The present disclosure relates to graph analytics. More specifically, this disclosure relates to a method and system for translating transaction data to graph representation for input to a graph analytics application, thereby facilitating efficient collaborative filtering applications.
2. Related Art
Hipergraph is a high-performance graph analytics engine that performs very fast queries on graph data. Graph data is data that can be easily represented by a graph. A graph is a set of vertices with edges that connect them. Hipergraph requires the input to be in a very specific format, but formatting many real-world graph datasets is non-trivial because the formatting operations exceed the typical memory and disk capacities of a single machine.
In one approach, one can perform automated compilation and formatting of data using scripts and UNIX utilities. This approach works relatively well when the inputs and output files and intermediary computation fit on a modern workstation. However, when the input graph dataset is on the order of several hundred gigabytes, one cannot even sort the data on a standard machine because of the time, disk space, and memory space required.
One embodiment of the present invention provides a system for generating a product recommendation by translating transaction data to graph representation for input to a graph analytics application. During operation, the system generates a transaction table to store transaction data, a customer table to store customer data, and a product table to store products data. The system then generates, with data from the transaction table, a table containing topology and edge identifier information and a table containing edge attribute information. Next, the system generates one or more headers that include data describing the customer table and/or the product table and/or the table containing edge attribute information. Subsequently, the system generates one or more files containing the one or more headers and data from the tables, in which the data describes a graph with edges representing transactions and vertices representing customers or products. The system then submits the one or more files as input to the graph analytics application to generate a product recommendation.
In a variation on this embodiment, generating a transaction table, a customer table, and a products table includes retrieving data from a table storing data according to a relational model.
In a variation on this embodiment, the transactions are purchase transactions.
In a variation on this embodiment, generating a customer table and/or product table and/or transaction table includes assigning unique consecutive integer values to each row of the customer table and/or product table and/or transaction table.
In a variation on this embodiment, the system sorts the table containing topology and edge identifier information first by the customer ID, then the edge type, then the product ID, and then the transaction ID.
In a variation on this embodiment, every step of the method is executed by a single script.
In the figures, like reference numerals refer to the same figure elements.
The following description is presented to enable any person skilled in the art to make and use the embodiments, and is provided in the context of a particular application and its requirements. Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the present disclosure. Thus, the present invention is not limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein.
Embodiments of the present invention solve the problem of translating transaction data to a format suitable for input to a graph analytics application by automatically compiling the data and translating the data to a graph model representation. A graph translation system may extract data from a data source, such as a relational table storing various types of data. The system processes and projects the data into intermediate tables, generates headers, and then writes the tables, headers, and additional tables with data describing a graph topology and associated edge attribute data into a suitably formatted text file. The generated graph reflects transaction relationships found in the extracted data. The graph translation system may then submit the generated graph data as input to a graph analytics application (e.g., Hipergraph) for analysis. One example application is analyzing purchase transaction data with Hipergraph to perform collaborative filtering, in order to suggest product recommendations to users.
The Hipergraph engine requires its input to be a text file formatted in a very specific way, with certain constraints and formatting requirements. The input text file should include a description of a graph and other associated data, including the graph topology and data associated with vertices. The graph structure data should be separated from data describing attributes of the vertices and edges. For collaborative filtering, vertices represent customers or products, while edges represent purchase transactions.
In one implementation, the system utilizes data warehousing software such as Hive built on top of Hadoop. Hive is a data warehouse system for Hadoop. Hive facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets residing in distributed storage with Hadoop-compatible file systems. Hive provides a SQL interface to access the stored data. Hadoop is the storage layer. Hadoop is a framework for the distributed processing of large data sets across clusters of machines, pulling together the machines' and CPU power. Hadoop provides for storage of data across the clusters, and allows for scaling from a single server to thousands of machines. It provides a scalable distributed file system that spans all the nodes in the clusters. Retailers may store hundreds of gigabytes of data in a Hadoop cluster. One may use Hive with Hadoop and Bash scripting in order to automatically compile data from typical relational database tables into a format appropriate for Hipergraph. Bash is a Unix shell. Bash can read commands from a script file and execute the commands within the script.
The input to Hipergraph is a set of tables and headers generated with Hive. In some implementations, one can execute SQL scripts in Hive to project customer, product, and transaction data from a main table to smaller intermediate tables. For example, in a retail database, there may be a main table where one column stores customer IDs, and another column stores the product ID for a purchase that the customer made. Other columns in this main table may store a transaction ID and the date of such a transaction. The intermediate tables generated by Hive may be written to storage in the form of text files, where each new-line delimited line in a text file represents a row in the table. The system may also generate headers and a table describing the topology of the graph, and another table describing attributes of edges.
One can utilize an automated procedure for translating data stored according to the relational model to a graph representation that is suitably formatted for input to Hipergraph. In an example implementation, one can invoke a script with a single command to produce graph representation output that can serve as input into Hipergraph. The system generates a text file with data representing the graph structure as the input to Hipergraph. The system may sort the columns of these files with a specific ordering, as specified by Hipergraph. The system can then input the generated graph to Hipergraph, thereby facilitating efficient analysis of transaction data.
The system may utilize the techniques disclosed herein to provide recommendations to a customer. A retailer may sell products to customers, and in order to provide recommendations to customers, the system may perform collaborative filtering. But collaborative filtering requires an analysis of customer transaction data, including purchase data. The system may examine a customer's purchase item, and then determine the purchases made by other customers who also purchased the same item as the first customer. The system may then recommend an item to the first customer. The customer transaction data may be mixed with other data and stored in a relational database (or some other format unsuitable for direct input to Hipergraph). The system can apply the translation techniques discussed herein to translate the transaction data into a format suitable for input to Hipergraph. Hipergraph can traverse graphs very efficiently to perform collaborative filtering.
For collaborative filtering, the system may generate a bipartite graph with purchase transaction data extracted from a retailer's database. A bipartite graph is a graph with vertices that can be divided into two disjoint sets U and V such that every edge connects a vertex in set U to a vertex in set V. Vertices representing products make up one set of the bipartite graph and vertices representing customers make up the other set of the bipartite graph. The graph represents the retailer's database of transaction data. Hipergraph may then traverse the bipartite graph to generate a product recommendation.
Note that without the translation process disclosed herein, one would need to perform database joins with data from a relational table, which is a time-consuming, inefficient process, especially when large relational tables are involved. Rather than performing database joins, the system traverses the bipartite graph with Hipergraph. For example, Hipergraph may traverse the bipartite graph to generate a product recommendation. Performing graph traversal in order to generate a product recommendation is much faster than performing a join with database tables.
Although examples are discussed herein with respect to Hipergraph, embodiments of the present invention may apply the disclosed techniques with any graph analytics application. Some embodiments may utilize a relational interface, such as HBase or SQL, instead of Hive. Some of the scripts may be written in C/C++ or other scripting languages, or standard UNIX scripting programs such as cat, join, paste, cut, and sort.
System 100 may include a set of servers 102A-102E executing as part of a cluster 104. Each of the servers 102A-102E may include a software and data stored on a storage device. For example, server 102E may include a storage device 106. Storage device 106 may store data and software such as graph analytics application 108 (e.g., Hipergraph), transaction data 110, graph data 112, and graph translation software 114. Some implementations may also include a client 116 communicating with servers in cluster 104 to request graph translation services associated with Hipergraph.
Note that different implementations of the present invention may include any number of servers and storage devices. Various inventive aspects of system 100 are further described below.
As illustrated in
System 100 may generate intermediate tables that include a customer table 212, a product table 214, and a purchase table 216. Customer table 212 may store data describing customers. Product table 214 may store data describing products that are available for purchase. Purchase table 216 may store data describing customer purchases of products, although different implementations may also include other types of transactions in this or another transaction table. System 100 retrieves data from main table 210, and stores the retrieved data in customer table 212, product table 214, and purchase table 216. Main table 210 may store data according to, for example, the relational model. Different retailers may store main tables that, depending on the domain, vary in structure or type of data stored, and may include any number of tables.
In the example implementation, system 100 generates some headers for the intermediate tables and writes the intermediate tables and headers to text file 206. System 100 may also include other tables containing edge-related information for submission to graph analytics application 108. One of the tables is illustrated in
Another field in header 306 may store the number of columns in edge attributes table 312 (e.g., 4 in the depicted example of
System 100 may sort the columns of table 322 to adhere to the requirements of graph analytics application 108. For example, system 100 may sort by the customer ID, then the edge type, then the product ID, and then the transaction ID. One may specify to Hive the sort order for table 322 when writing table 322 to a text file. System 100 may also generate a redundant topology file containing the same data as table 322 but sorted differently (e.g., sorting according to different columns). Graph analytics application 108 may require this redundant topology file to facilitate efficient graph analysis. Note that storing the topology of the graph separately from the graph metadata facilitates efficient processing of the graph by allowing the system to not load the metadata until needed.
During operation, system 100 may initially compile a Java program which assigns consecutive ID numbers to each row of a table to be generated (operation 402). For each type of vertex (or edge data), system 100 executes steps that process (e.g., generates and/or populates) each vertex table (or edge-related table) in turn (operation 404). System 100 may execute operations 406 to 416 to generate and populate each respective intermediate table.
System 100 selects the next unique ID from the Java program, and selects data associated with one or more vertices (or edges) (operation 406). For example, system 100 may extract the data for customers/products by executing SQL statements such as SELECT DISTINCT on a column storing a customer ID/product ID (or edge ID). System 100 retrieves the data that are relevant for customers (e.g., for customer table 212) or the data that are relevant for products (e.g., for product table 216) (or data that are relevant for edges, e.g., for purchase table 216). With this operation system 100 can retrieve all the row data for populating the customer or product (or purchase) intermediate tables, and system 100 associates each row of the new table with a unique ID that increases in value with each added row. System 100 then validates the data (e.g., verifies that the fields are not null) (operation 408). System 100 may also normalize and re-factor data.
Next, system 100 inserts the resulting data as rows into a new table, such as customer table 212 or product table 214 (or purchase table 216) (operation 410). For example, for a customer John Smith, system 100 projects data including vertex ID and data associated with John Smith into customer table 308, along with data for all the other customers. In some implementations, when extracting transactions from main table 210, system 100 may separate transactions involving multiple products into multiple transactions, each transaction with a single product. In some implementations, some of these transactions may include refunds, purchases, or click-throughs. Hipergraph requires each row to be enumerated in sequence, so system 100 may assign consecutive ID numbers to each row, e.g., assign sequence numbers from 1 to the total number of customers/products/edges. In some implementations, the intermediate tables may store data in normalized form.
System 100 may then compute the maximum ID assigned (e.g., using Hive), and may pass this maximum ID value to the next iteration starting from operation 404 (operation 412). System 100 then generates a header file for a respective table (operation 414). Note that generating the header file may include computing the number of rows in the respective table and determining whether the columns are strings or integers. For integer columns, system 100 determines whether the columns are 8-bit, 16-bit, or 32-bit integers by examining the minimum and maximum values.
System 100 writes the formatted table onto the local file system, sorting by columns as appropriate (operation 416). In one implementation, system 100 may execute SQL queries to determine the size of strings and other fields in order to allocate disk space for the table. System 100 may generate text file 206 while writing the formatted table.
System 100 then projects the original table (e.g. main table 210) into a smaller table (e.g., purchase table 216) that contains only the edge attributes and the graph structure (operation 418). Subsequently, system 100 generates a topology table which contains only the edge ID and connection information between vertices, an example of which is depicted in
In some embodiments, transaction data receiving module 502 can receive the transaction data, such as purchase data. Graph analytics application 108 (e.g., Hipergraph) receives the generated graph and analyzes the generated graph to facilitate generating purchase recommendations. Graph translation software 114 executes the translation process described herein to generate graph data 112 from transaction data 110.
The data structures and code described in this detailed description are typically stored on a computer-readable storage medium, which may be any device or medium that can store code and/or data for use by a computer system. The computer-readable storage medium includes, but is not limited to, volatile memory, non-volatile memory, magnetic and optical storage devices such as disk drives, magnetic tape, CDs (compact discs), DVDs (digital versatile discs or digital video discs), or other media capable of storing computer-readable media now known or later developed.
The methods and processes described in the detailed description section can be embodied as code and/or data, which can be stored in a computer-readable storage medium as described above. When a computer system reads and executes the code and/or data stored on the computer-readable storage medium, the computer system performs the methods and processes embodied as data structures and code and stored within the computer-readable storage medium.
Furthermore, methods and processes described herein can be included in hardware modules or apparatus. These modules or apparatus may include, but are not limited to, an application-specific integrated circuit (ASIC) chip, a field-programmable gate array (FPGA), a dedicated or shared processor that executes a particular software module or a piece of code at a particular time, and/or other programmable-logic devices now known or later developed. When the hardware modules or apparatus are activated, they perform the methods and processes included within them.
The foregoing descriptions of various embodiments have been presented only for purposes of illustration and description. They are not intended to be exhaustive or to limit the present invention to the forms disclosed. Accordingly, many modifications and variations will be apparent to practitioners skilled in the art. Additionally, the above disclosure is not intended to limit the present invention.