Modern data processing systems, such as general purpose computer systems, allow the users of such systems to create a variety of different types of data files. For example, a typical user of a data processing system may create text files with a word processing program such as Microsoft Word or may create an image file with an image processing program such as Adobe's PhotoShop. Numerous other types of files are capable of being created or modified, edited, and otherwise used by one or more users for a typical data processing system. The large number of the different types of files that can be created or modified can present a challenge to a typical user who is seeking to find a particular file which has been created.
Modern data processing systems often include a file management system which allows a user to place files in various directories or subdirectories (e.g. folders) and allows a user to give the file a name. Further, these file management systems often allow a user to find a file by searching not only the content of a file, but also by searching for the file's name, or the date of creation, or the date of modification, or the type of file. An example of such a file management system is the Finder program which operates on Macintosh computers from Apple Computer, Inc. of Cupertino, Calif. Another example of a file management system program is the Windows Explorer program which operates on the Windows operating system from Microsoft Corporation of Redmond, Wash. Both the Finder program and the Windows Explorer program include a find command which allows a user to search for files by various criteria including a file name or a date of creation or a date of modification or the type of file. This search capability searches through information which is the same for each file, regardless of the type of file. Thus, for example, the searchable data for a Microsoft Word file is the same as the searchable data for an Adobe PhotoShop file, and this data typically includes the file name, the type of file, the date of creation, the date of last modification, the size of the file and certain other parameters which may be maintained for the file by the file management system.
Certain presently existing application programs allow a user to maintain data about a particular file. This data about a particular file may be considered metadata because it is data about other data. This metadata for a particular file may include information about the author of a file, a summary of the document, and various other types of information. Some file management systems, such as the Finder program, allow users to find a file by searching through the metadata.
In a typical system, the various content, file, and metadata are indexed for later retrieval using a program such as the Finder program, in what is commonly referred to as an inverted index. For example, an inverted index might contain a list of references to documents in which a particular word appears. Given the large numbers of words and documents in which the words may appear, an inverted index can be extremely large. The size of an index presents many challenges in processing and storing the index, such as updating the index or using the index to perform a search.
Methods and systems for processing an inverted index in a data processing system are described herein.
According to one aspect of the invention, a 2-level table is used for inverting an index. Since some terms occur far more commonly than others, a smaller table contains a subset of the more frequently occurring terms, and a larger table contains terms that occur rarely. An algorithm may be employed to determine whether terms occurring in the item should be indexed using the smaller table of frequently occurring terms or the larger table of terms that occur rarely. For example, the algorithm may include calculating the frequency with which a term occurs in the item or across all items. Because the smaller table will be updated more often than the larger table, the smaller table is generally optimized for updating, i.e., for making room in the table for inserts, whereas the larger table is generally not optimized for updating. The 2-level table may be used for an index of a single item or a corpus of items and decreases memory pressure and increases performance.
According to one aspect of the invention, a postings file containing one or more postings lists is updated in reverse order. Each item in a postings list is updated with a pointer that points to the previous item that was entered for that term. As a result, old data in the postings file is referenced from new data, thus avoiding writing over old data and using a minimum memory footprint. When the space allocated for the postings file is exhausted, a new space is allocated and also updated in reverse order. Because each item in the postings list is updated with the pointer that points to the previous item that was entered for that term, the first entry for each postings list in the new space is updated with a pointer that points to the last entry that was entered for that term's postings list in the old space. As a result, during access, the postings file may be efficiently read in the forward direction, with the occasional large jump backwards in the file, accrued over many forward reads, instead of making many small backwards reads.
According to one aspect of the invention, the postings entries in the postings file are stored in term order. In some cases, the most recent posting may be stored in a table, such as the smaller 2-level term table, allowing fast term frequency calculation. The most recent posting is updated to point to the next posting for that term in the postings file. Lastly, because updates are appended to the postings file, and because the file is written before updating the pointers into it, access to the file can be done without locks.
According to one aspect of the invention, an updates set of an index are flushed to minimize memory use and maximize disk bandwidth. Flushing includes, among other actions, sorting the update set in string order, and obtaining the page offsets for each string. The update set entries may then be grouped by their page offsets and resorted into page offset major order and string sorted minor order, and inserted in to the index in that order so that the store pages of the index are accessed in disk block order, thus minimizing memory use and maximizing disk bandwidth. In this manner, a single cursor may be used to point to the last accessed location in the page to decrease search time for string insertion.
The present invention is illustrated by way of example and not limitation in the figures of the accompanying drawings in which like references indicate similar elements.
The embodiments of the present invention will be described with reference to numerous details set forth below, and the accompanying drawings will illustrate the described embodiments. As such, the following description and drawings are illustrative of embodiments of the present invention and are not to be construed as limiting the invention. Numerous specific details are described to provide a thorough understanding of the present invention. However, in certain instances, well known or conventional details are not described in order to not unnecessarily obscure the present invention in detail.
The present description includes material protected by copyrights, such as illustrations of graphical user interface images. The owners of the copyrights, including the assignee of the present invention, hereby reserve their rights, including copyright, in these materials. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office file or records, but otherwise reserves all copyrights whatsoever. Copyright Apple Computer, Inc. 2007.
Various different software architectures may be used to implement the functions and operations described herein, such as to perform the methods shown in
In one exemplary embodiment, the find by content software 106 and/or the find by metadata software 110 is used to find a term present in the file data 104 or meta data 108. For example, the software 106/110 may be used to find text and other information from word processing or text processing files created by word processing programs such as Microsoft Word, etc.
The find by content software 106 and find by metadata software 110 are operatively coupled to databases which include one or more indexes 122. The indexes 122 represent at least a subset of the data files in a storage device, including file data 104 and meta data 108, and may include all of the data files in a particular storage device (or several storage devices), such as the main hard drive of a computer system. The one or more indexes 122 comprise an indexed representation of the content and/or metadata of each item stored on the data files 104/108, such as a text document, music, video, or other type of file. The find by content software 106 searches for a term in that content by searching through the one or more index files 122 to see if the particular term, e.g., a particular word, is present in items stored on data files 104 which have been indexed. The find by content software functionality is available through find by metadata software 110 which provides the advantage to the user that the user can search the indexes 122 for the content 104 within an item stored on the data files 104 as well as any metadata 108 that may have been generated for the item.
In one embodiment of the present invention, indexing software 102 is used to create and maintain the one or more indexes 122 that are operatively coupled to the find by content and metadata software applications 106/110. Among other functions, the indexing software 102 receives information obtained by scanning the file data 104 and meta data 108, and uses that information to generate a postings list 112 that identifies an item containing a particular term, or having metadata containing a particular term. As such, the postings list 112 is a type of inverted index that maps a term, such as a search term, to the items identified in the list. In a typical embodiment, the information obtained during the scan includes a unique identifier that uniquely identifies the item containing the particular term, or having metadata containing the term. For example, items such as a word processing or text processing file have unique identifiers, referred to as ITEMIDs. The ITEMIDs are used when generating the postings list 112 to identify those items that contain a particular term, such as the word “Apple.” ITEMIDs identifying other types of files, such as image files or music files, may also be posted to the postings list 112, in which case the ITEMID typically identifies items having metadata containing a particular term.
In one embodiment, the indexing software 102 accumulates postings lists 112 for one or more terms into one or more update sets 120 and, from time to time, flushes the updates sets 120 into one or more index files 122. The postings lists 112 for one or more items may also be stored in a postings file 118. The indexing software 102 may employ one or more indexing tables 114 that comprise one or more term tables, including a two-level table that separates the more frequently occurring terms from the less frequently occurring terms. The tables 114 may also include a postings table that comprises one or more postings lists for the terms that are being indexed. In one embodiment, the indexing software may maintain a live index 116 to contain the most current index. In some cases, updates to an index may be generated in a delta postings list 126 that is a specially marked postings list that may be dynamically applied to an index 122, postings files 118, updates sets 120, or other forms of an index in order to insure that the most current information is returned whenever those indexes are accessed.
In a typical embodiment, the previous postings entries stored in a postings table, such as the illustrated postings table A 304. As illustrated, the postings table 304 comprises a storage space having a series of slots 306, some of which are occupied by the postings entries for the terms in the LEVEL 1 term table 302. Each immediate postings entry 306/308 contains a pointer to the previous posting entry for that term, i.e., the previous posted item. As such, the entries in the postings table A 304 form a series of interleaved linked lists. Storing the immediate posting entry in the term table 302 enables the postings table 304 to be updated more efficiently, by simply copying the immediate posting entry 306/308 from the term table 302 to the postings table 304 as needed, and changing the pointers accordingly. It should be noted that the LEVEL 2 term table 204 typically does not include such an immediate posting entry, since it would often be unused. However, in some embodiments the LEVEL 2 table may include such an entry.
In a typical embodiment, the postings table 304 is stored in a postings file 118 in a storage medium. Due to their volatile nature and large size, writing and reading the postings tables 304 to and from the storage medium can consume large amounts of processor time and memory. Therefore, a number of measures may be employed to optimize the processing of the postings tables 304. For example, in a typical embodiment, the previous posting entries are referenced from the new postings entries. As each immediate postings entry 306/308 is moved into the postings table 304, the pointer contained in the entry continues to point to the previous posting entry for that term. This insures that the previous postings entries are not overwritten.
Another measure to optimize processing includes writing the postings table 304 in reverse order, i.e., updating the available slots 306 in the postings table 304 first from the end of the table until reaching the beginning of the table. For example, as illustrated in
As shown in
As illustrated in
A second algorithm that may be employed would be to simply move an entry into the LEVEL 1 table whenever it is references. A third algorithm tracks the number of occurrences in the currently process item, and moves a term into the LEVEL 1 table when referenced if it has occurred more in the currently processed document than the term that is currently occupying the contested slot in the LEVEL 1 table. This has the advantage of allowing the composition of terms maintained in the LEVEL 1 table to quickly adapt to changes in language from one item to the next, while limiting thrashing.
Regardless of which algorithm is employed, processing continues at block 706, in which the indexing software 102 copies the currently posted ITEMID from the term table, typically the LEVEL 1 table, into the next available slot in the postings table. This prevents updates to the postings table from unnecessarily taking up space in the processor cache, and allows the operating system to page out the data when the system is under memory pressure.
At block 708, the indexing software 102 updates the pointer in the term table, typically the LEVEL 1 table, to reference the slot into which the currently posted ITEMID was copied. At block 710, the indexing software 102 is then ready to post the new ITEMID to the appropriate term table. Because the LEVEL 1 table is the more active table, and the LEVEL 2 table the less active table, the LEVEL 2 table may be optimized for searching rather than updating without significantly slowing down processing.
In
In
As shown in
It will be apparent from this description that aspects of the present invention may be embodied, at least in part, in software. That is, the techniques may be carried out in a computer system or other data processing system in response to its processor, such as a microprocessor, executing sequences of instructions contained in a memory, such as ROM 1007, RAM 1005, mass storage 1006 or a remote storage device. In various embodiments, hardwired circuitry may be used in combination with software instructions to implement the present invention. Thus, the techniques are not limited to any specific combination of hardware circuitry and software nor to any particular source for the instructions executed by the data processing system. In addition, throughout this description, various functions and operations are described as being performed by or caused by software code to simplify description. However, those skilled in the art will recognize what is meant by such expressions is that the functions result from execution of the code by a processor, such as the microprocessor 1003.