This disclosure relates generally to data processing, and in particular to simplifying large-scale data processing.
Large-scale data processing involves extracting data of interest from raw data in one or more data sets and processing it into a useful product. Data sets can get large, frequently gigabytes to terabytes in size, and may be stored on hundreds or thousands of server machines. While there have been developments in distributed file systems that are capable of supporting large data sets (such as Hadoop Distributed File Systems and S3), there is still no efficient and reliable way to index and process the gigabytes and terabytes of data for ad-hoc querying and turn them into a useful product or extract valuable information from them. An efficient way of indexing and processing large-scale data is desired.
In one aspect, the inventive concept pertains to a computer-implemented method of processing data by creating an inverted column index is presented. The method entails categorizing words in a collection of source files according to data type, generating a posting list for each of the words that are categorized, and organizing the words in an inverted column index format. In an inverted column index, each column represents a data type, and each of the words is encoded in a key and the posting list is encoded in a value associated with the key. In some cases, the words that are categorized may be the most commonly appearing words arranged in the order of frequency of appearance in each column. This indexing method provides an overview of words that are in a large dataset, allowing a user to choose the words that are of interest to him and “drill down” into contents that include that word by way of queries.
In another aspect, the inventive concept pertains to a non-transitory computer-readable medium storing instructions that, when executed, cause a computer to perform a method for processing data using an inverted column index. The method entails accessing source files from a database and creating the inverted column index with words that appear in the source files. The inverted column index is prepared by categorizing words according to data type, associating a posting list for each of the words that are categorized, and organizing the words in an inverted column index format, with each column representing a data type, wherein each of the words is included in a key and the posting list is included in a value associated with the key.
In yet another aspect, the inventive concept pertains to a computer-implemented method of processing data by creating an inverted column index. The method entails categorizing words in a collection of source files according to data type, generating a posting list for each of the words that are categorized, encoding a key with a word of the categorized words, its data type, its column ordinal, an identifier for the source file from which the word came, the word's row position in the source file document, and a facet status to create the inverted column index, and encoding a value with the key by which the value is indexed and the posting list that is associated with the key. The method further entails selecting rows of the source files and faceting the selected rows by storing the selected rows in a facet list, indicating, by using the facet status of a key, whether the row in the key is faceted, in response to a query including a word and a column ordinal, using the keys in the inverted column index to identify source files that contain the word and the column of the query that are faceted, and accessing the facet list to parse the faceted rows in an inverted column index format to allow preparation of a summary distribution or a summary analysis that shows most frequently appearing words in the source files that match the query.
In one aspect, the inventive concept includes presenting a summary distribution of content in a large data storage to a user upon the user's first accessing the data storage, before any query is entered. The summary distribution would show the frequency of appearance of the words in the stored files, providing a general statistical distribution of the type of information that is stored.
In another aspect, the inventive concept includes organizing data in a file into rows and columns and faceting the rows at a predefined sampling rate to generate the summary distribution.
In yet another aspect, the inventive concept includes presenting the data in the storage as a plurality of columns, wherein each of the columns represents a key or a type of data and the data cells are populated with terms, for example in order of frequency of appearance. Posting lists are associated with each term to indicate the specific places in the storage where the term appears, for example by document identifier, row, and column ordinal.
In yet another aspect, the inventive concept includes executing a query by identifying a term for a specified ColumnKey. Boolean queries may be executed by identifying respective terms for a plurality of ColumnKeys and specifying an operation, such as an intersection or a union.
In yet another aspect, the inventive concept includes caching results of some operations at client computer and reusing the cached results to perform additional operations.
The disclosure pertains to a method and system for building a search index. A known data processing technique, such as MapReduce, may be used to implement the method and system. MapReduce typically involves restricted sets of application-independent operators, such as a Map operator and a Reduce operator. Generally, the Map operator specifies how input data is to be processed to produce intermediate data, and the Reduce operator specifies how the intermediate data values are to be merged or combined.
The disclosed embodiments entail building an index having a columnar inverted indexing structure that includes posting lists arranged in columns. The inverted indexing structure allows posting lists to be efficiently retrieved and transferred to local disk storage on a client computer on demand and as needed, by a runtime execution engine. Query operations such as intersections and unions can then be efficiently performed using relatively high performance reads from the local disk. The indexing structure disclosed herein is scalable to billions of rows.
The columnar inverted index structure disclosed herein strives to balance performance/scalability with simplicity. One of the contributors to the complexity of search toolkits (e.g., Lucene/Solr) is their emphasis on returning query results with subsecond latency. The columnar inverted indexing method described herein allows the latency constraint to be relaxed to provide search times on the order of a few seconds, and to make it as operationally simple as possible to build, maintain, and use with very large search indexes (Big Data).
The columnar inverted index also provides more than simple “pointers” to results. For example, the columnar inverted index can produce summary distributions over large result sets, thereby characterizing the “haystack in the haystack” in response to user request in real time and in different formats. The columnar inverted index represents a departure from a traditional approach to search and is a new approach aimed at meeting the needs of engineers, scientists, researchers, and analysts.
Unlike a conventional posting list, the posting lists described herein are columnar so that for each extant combination of term and column (e.g., “hello, column 3”), a posting list exists. The columnar posting lists allow Boolean searches to be conducted using columns and not rows, as will be described in more detail below.
Table 1 below shows the information that ColumnKey encodes during Column Key encoding process 62. The information includes type, term, column, URI, position, and Facet status.
As mentioned above, MapReduce may be used to build the search index. The ColumnKey object includes a key partitioning function that causes column keys emitted from the mapper to arrive at the same reducer. For the purpose of generating posting lists, the mapper emits a blank value. The ColumnKey key encodes the requisite information. ColumnKeys having the same value for the fields type, term, and column will arrive at the same reducer. The order in which they arrive is controlled by the following ColumnKey Comparator:
Therefore, the keys are ordered in the following nesting order:
The keys control the sorting of the posting lists. As such, a reducer initializes a new posting list each time it detects a change in either the type, term, or column ordinal fields of keys that it receives. Subsequently, received keys having the same (posting, term, column ordinal) tuple as the presently-initialized posting list may be added directly to the posting list.
A problem in Reducer application code is providing the ability to “rewind” through a reducer's iterator to perform multi-pass processing (Reducer has no such capability in Hadoop). To overcome this problem, the indexing process 60 may emit payload content into a custom rewindable buffer. The buffer implements a two-level buffering strategy, first buffering in memory up to a given size, and then transferring the buffer into an Operating System allocated temporary file when the buffer exceeds a configurable threshold.
The posting list generation process 64 includes a posting list abstraction process 66 and posting list encoding process 68. During the abstraction process 66, posting lists are abstracted as packed binary number lists. The document URI, the row position, and the faceted field are encoded into a single integer with a predetermined number of bits. For example, a single 64-bit integer may break down as follows:
Bits 62 and 63 may be zeroed out with simple bitmask, allowing the process to treat the integer as a 62-bit unsigned number whose value increases monotonically. In this particular embodiment where the lower 40 bits encode the row's physical file address, files up to 240 bytes (1 terabyte) can be indexed. The document identifier (the URI) may be obtained by placing the source file URIs in a lexicographically ordered array and using the array index of a particular document URI as the document identifier. Bits 40-61 (22 bits) encode the document identifier, so up to 222 or a little more than 4 million documents can be included in a single index. The number of bits used for the row position and the document identifier can be changed as desired, for example so that more documents can be included in a single index at the cost of reducing the maximum indexable length of each document.
During the posting list encoding process 68, successively-packed binary postings are delta-encoded, whereby the deltas are encoded as variable length integers. The following code segment illustrates how the postings may be decoded:
An object named ColumnFragment encodes posting lists. The encoding is done such that a posting list may be fragmented into separate pieces, each of which could be downloaded by a client in parallel. Table 2 depicts an exemplary format of ColumnFragment, having the following four fields: ColumnKey, sequence number, length, and payload. As shown, the payload is stored as an opaque sequence of packed binary longs, each encoding a posting. As mentioned above, the posting list indicates all the places where the ColumnKey term appears. The posting object does not store each posting as an object or primitive subject to a Hadoop serialization/deserialization event (i.e., “DataInput, DataOutput” read and write methods) as this incurs the overhead of a read or write call for each posting. Packing the postings into a single opaque byte array allows Hadoop serialization of postings to be achieved with a single read or write call to read or write the entire byte array en masse. A Sequence File is output by the Reducer. The SequenceFile's keys are of type ColumnKey, and values are of type ColumnFragment.
When a particular term-occurrence (posting) is “faceted”, it means the entire row in the source data file in which said posting occurred has been sampled and indexed into the Facet List corresponding to the posting. When a posting list is processed in the indexing process 60, and postings having the faceted bit set in their packed binary representation, the runtime engine 10 is instructed to retrieve said entire row from the Facet List and pass it to the FacetCounter.
A single key in the Sequence File is itself a ColumnKey Object, thus describing a term and column, and the corresponding value in the sequence file is either a posting list or a facet list depending on the type field of the ColumnKey. A sequence file consists of many such key value pairs, in sequence. The Sequence File may be indexed using the Hadoop Map File paradigm. A Map File is an indexed Sequence File (a sequence file with an additional file called the index file). The Map File creates an index entry for each and every posting list. In some cases, the default behavior of a Map File may be set to index one of every 100 entries. In these cases, an index entry would exist for 1 of every 100 ColumnKeys, thereby forcing linear scans from an indexed key to the desired key. On average this would be 50 key-value pairs to be scanned (50 because that would be the average distance between the one of every 100 that is indexed). Therefore, to avoid linear scans, an index entry is generated for each key in the Sequence File. As posting lists can be large binary objects, direct, single seeks are more desirable than a thorough scan through the large posting lists. Therefore, an index entry is generated for each ColumnKey/ColumnFragment pair, and linear scans through vast amounts of data are avoided. The files generated as part of MapReduce reside in a Hadoop compatible file system, such as HDFS and S3.
The search index and the summary distribution reside in the distributed file system 20. In one embodiment of the inventive concept, the summary distribution is presented to a user when a user first accesses a distributed file system, as a starting point for whatever the user is going to do. The summary distribution provides a statistical overview of the content that is stored in the distributed file system, providing the user some idea of what type of information is in the terabytes of stored data.
Using the summary distribution as a starting point, the user may “drill down” into whichever field that is of interest to him. For example, in the summary distribution of
To support summary analysis on queries, a posting list may have a corresponding Facet List. A “facet,” as used herein, is a counted unique term, such as “USA” as shown in
The indexing technique disclosed herein maintains a local disk-based BTree for the purpose of resolving the location of columnar posting list in the distributed file system, or in local disk cache. The runtime engine 10, as part of its initialization process, reads the Map File's Index file out of the distributed file system and stores it in an on-disk BTree implementing the Java NavigableSet<ColumnKey> interface. The ColumnKey object includes the following fields, which are generally not used during MapReduce, but which are populated and used by the runtime engine 10:
The ColumnKey objects are stored in a local-disk based BTree, making prefix scanning practical and as simple as using the NavigableSet's headset and tailSet methods to obtain an iterator that scans either forward or backward in the natural ordering, beginning with a given key. For example, to find all index terms beginning with “a,” the tailSet for a ColumnKey with type=POSTING and term=“a” can be iterated over. Notice that not only are all terms that begin with “a” accessible, but all columns in which “a” occurs are accessible and differentiable, due to the fact that the column is one of the fields included in the ColumnKey's Comparator (see above). Term scanning can also be applied to terms that describe a hierarchical structure such as an object “dot” notation, for instance “address.street.name.” Index scanning can be used to find all the fields of the address object, simply by obtaining the tailSet of “address.” For objects contained in particular columns (such as JSON embedded in a column of a CSV file), “dot” notation can be combined with column information, enabling the index to be scanned for a particular object field path and the desired column. Index terms can also be fuzzy matched, for example by storing Hilbert number in the term field of the ColumnKey as described in U.S. patent application Ser. No. 14/030,863.
The drilling down into the summary distribution may be achieved through a Boolean query. For example, instead of clicking on the word “iOS” under the operating system column as described above, a user may type in a Boolean expression such as “column 5=iOS.” The runtime engine 10 parses queries and builds an Abstract Syntax Tree (AST) representation of the query (validating that the query conforms to a valid expression in the process). The Boolean OR operator (|) is recognized as a union, and the Boolean AND operator (&&) is recognized as an intersection operation. A recursive routing is used to execute and pre-order a traversal of the AST. This is best explained by direct examination of the source subroutine. The parameters are as follows:
The result (return type) of the Boolean query is a File array. Every part of the Syntax tree in a Boolean query is cached separately. Therefore, there is no memory data structure consuming memory, such as List or byte array. Although Files are slower to read and write than in-memory data structures, the use of files has several advantages over memory:
The AST Navigation may be executed as follows:
A PostingDecoder object decodes the posting lists. Two posting lists may be intersected according to the following logic. Note that it is up to the caller of the nextIntersection method to perform faceting if so desired. The Intersection process is carried out as follows:
The next intersection is invoked as follows:
The Union operation's logic finds all elements of the union, stopping at the first intersection. Consequently, the caller passes in the FacetCounter so that the potentially numerous elements of the union may be faceted without returning to the calling code. The Union process is executed as follows:
The CollectUnions process is invoked as follows:
Various embodiments of the present invention may be implemented in or involve one or more computer systems. The computer system is not intended to suggest any limitation as to scope of use or functionality of described embodiments. The computer system includes at least one processing unit and memory. The processing unit executes computer-executable instructions and may be a real or a virtual processor. The computer system may include a multi-processing system which includes multiple processing units for executing computer-executable instructions to increase processing power. The memory may be volatile memory (e.g., registers, cache, random access memory (RAM)), non-volatile memory (e.g., read only memory (ROM), electrically erasable programmable read only memory (EEPROM), flash memory, etc.), or combination thereof. In an embodiment of the present invention, the memory may store software for implementing various embodiments of the present invention.
Further, the computer system may include components such as storage, one or more input computing devices, one or more output computing devices, and one or more communication connections. The storage may be removable or non-removable, and includes magnetic disks, magnetic tapes or cassettes, compact disc-read only memories (CD-ROMs), compact disc rewritables (CD-RWs), digital video discs (DVDs), or any other medium which may be used to store information and which may be accessed within the computer system. In various embodiments of the present invention, the storage may store instructions for the software implementing various embodiments of the present invention. The input computing device(s) may be a touch input computing device such as a keyboard, mouse, pen, trackball, touch screen, or game controller, a voice input computing device, a scanning computing device, a digital camera, or another computing device that provides input to the computer system. The output computing device(s) may be a display, printer, speaker, or another computing device that provides output from the computer system. The communication connection(s) enable communication over a communication medium to another computer system. The communication medium conveys information such as computer-executable instructions, audio or video information, or other data in a modulated data signal. A modulated data signal is a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired or wireless techniques implemented with an electrical, optical, RF, infrared, acoustic, or other carrier. In addition, an interconnection mechanism such as a bus, controller, or network may interconnect the various components of the computer system. In various embodiments of the present invention, operating system software may provide an operating environment for software's executing in the computer system, and may coordinate activities of the components of the computer system.
Various embodiments of the present invention may be described in the general context of computer-readable media. Computer-readable media are any available media that may be accessed within a computer system. By way of example, and not limitation, within the computer system, computer-readable media include memory, storage, communication media, and combinations thereof.
Having described and illustrated the principles of the invention with reference to described embodiments, it will be recognized that the described embodiments may be modified in arrangement and detail without departing from such principles. It should be understood that the programs, processes, or methods described herein are not related or limited to any particular type of computing environment, unless indicated otherwise. Various types of general purpose or specialized computing environments may be used with or perform operations in accordance with the teachings described herein. Elements of the described embodiments shown in software may be implemented in hardware and vice versa.
While the exemplary embodiments of the present invention are described and illustrated herein, it will be appreciated that they are merely illustrative.
This application claims the benefit of U.S. Provisional Application No. 61/758,691 that was filed on Jan. 30, 2013, the content of which is incorporated by reference herein.
Number | Date | Country | |
---|---|---|---|
61758691 | Jan 2013 | US |