This disclosure relates generally to data storage, and, more specifically, to manipulating a skip list data structure.
In the computer science field, various complex data structures have been developed to facilitate the storage of information. These data structures are often created using multiple pointers to join a collection of records together. When designing a complex structure, a developer is often weighing concerns related to the complexities of inserting and retrieving information as well as the overall data structure size. A skip list is one example of a more complex data structure, which can be popular as it can maintain large data sets while still offering up to O(log n) insertion complexity and up to O(log n) search complexity. In this type of data structure, records may be sorted based on key order and associated using a linked hierarchy of data record sequences, with each successive sequence skipping over fewer elements than the previous sequence. This linked hierarchy is implemented using varying heights of pointer towers such that, within a given a tower, pointers may be arranged based on the numbers of skipped-over records. This ability to skip over records when the skip list is traversed may allow a given record to be located more quickly than scanning through the records sequentially.
In some instances, skip lists may be used to maintain large quantities of information that is frequently manipulated. For example, as will be described below in further detail, a database system may use a buffer data structure to store data of active database transactions until the database transactions can be committed and their data flushed to a persistent storage of the database system. The buffer data structure may include a skip list data structure that enables efficient storage and lookup of transaction records in key order. As this database system may process a high volume of transactions in parallel, efficient scanning of the skip list can be important for database performance.
The present disclosure describes embodiments in which a more efficient scanning algorithm is employed to scan for records in a skip list and/or insert new records in the skip list. As will be described below in various embodiments, a skip list can be divided into sections based on prefixes of the keys such that a given a section includes keys having the same prefix. When a scan for a particular key is performed, the prefix of the key is initially used to determine the relevant section for that prefix. A scan can then be initiated within that section and performed over only a subset of the skip list. This stands in contrast to other approaches in which a scan is initiated at the initial node (i.e., the sentinel node) and performed over the entire skip list. As a given section is considerably smaller than the entire skip list, far fewer memory accesses are performed. This can result in significant performance improvements—particularly in the exemplary database system discussed below, which, in one embodiment, may maintain almost 200,000,000 key-value records in the skip list at a given time.
The present disclosure begins with a discussion of a database system in conjunction with
Turning now to
Transaction manager 104, in one embodiment, includes program instructions that are executable to process received database transactions 102. In general, transactions 102 may be issued to read or write data to database 108 and may be received from any of various sources such as one or more client devices, application servers, software executing on database system 10, etc. As will be described in greater detail below, this processing may entail manager 104 initially storing records 112 for key-value pairs of transactions 102 in buffer data structure 106 until the records 112 can be flushed to the persistent storage of database 108. Accordingly, various functionality described below with respect to buffer data structure 106 may be implemented by transaction manager 104 such as adding key-value records 112 to record chains 110, facilitating acquisition of hash-bucket latches 126 for transactions 102, modifications to active transaction list 130 and skip list 140, etc.
Buffer data structure 106, in one embodiment, is a data structure that buffers key-value pairs for active transactions until the transactions commit. As will be described below, buffer data structure 106 is structured in a manner that allows for quick insertion of key-value pairs, which can be performed concurrently in some instances allowing for high volumes of transactions to be processed efficiently. Still further, buffer data structure 106 may reside in a local memory allowing for faster reads and writes than the persistent storage of database 108 where the data resides long term. In various embodiments, buffer data structure 106 allows concurrent modifications to be performed to it for different transactions 102, but provides a concurrency control mechanism via hash-bucket latches 126 for data within buffer data structure 106. In some embodiments, committed transaction data is asynchronously flushed from buffer data structure 106 to the persistent storage of database 108. That is, rather than perform a flush for each transaction 102's data upon its commitment, a flush is performed periodically for multiple committed transactions 102. For example, in one embodiment, transaction manager 104 initiates a flush to database 108 in response to buffer data structure 106 satisfying a particular size threshold.
Database 108 may correspond to any suitable form of database implementation. In some embodiments, database 108 is a non-relational database that is implemented using a log-structured merge (LSM) tree for persistent storage. In some embodiments, layers of the LSM tree may be distributed across multiple physical computer systems providing persistent storage. In some embodiments, these computers systems are cluster nodes of a computer cluster that provides a cloud-based system accessible to multiple clients. In some embodiments, database 108 may be part of a software as a service (SaaS) model; in other embodiments, database 108 may be directly operated by a user.
As noted above, when transaction manager 104 stores a key-value pair for an active transaction 102 in buffer data structure 106, a corresponding key-value record 112 may be created that includes the value and the key. If multiple transactions 102 attempt to write values associated with the same key, key-value records 112 may be generated for each value and linked to together to form a record chain 110 corresponding to the key. For example, if a user has withdrawn a first amount from a bank account resulting in a first database transaction 102 and then a second amount resulting in a second database transaction 102, a record chain 110 corresponding to an account-balance key may have two key-value records 112 reflecting those withdrawals. In various embodiments, each record 112 includes a transaction identifier (e.g., a transaction sequence number) specifying its associated transaction 102; records 112 may also be organized in a record chain 110 based on the ordering in which the transactions 102 are received. For example, as described below with respect to
Hash table 120, in one embodiment, is a data structure that allows constant-time lookups of record chains 110 based on given a key. That is, when a key is received, hash table 120 is indexed into by applying hash function 122 to the key to produce the appropriate index value for the hash bucket 124 corresponding to the key. The direct pointer in the hash bucket 124 may then be referenced to obtain to the record chain 110. Being able to perform constant-time lookups may significantly reduce the time consumed to read key-value records 112, write records 112, or perform key probes (i.e., determining whether a key has a key-value record 112 present in buffer data structure 106).
As noted above, in various embodiments, each hash bucket 124 includes a respective latch 126 that controls access to its record chain 110. Accordingly, when a transaction is attempting to read or write a value associated with a particular key, the key may be used to index into hash table 120 and acquire the latch 126 corresponding to the key's associated hash bucket 124 before reading or writing is performed. If a latch 126 cannot be acquired for a database transaction 102, processing the database transaction 102 may be delayed until the latch 126 is released. In some embodiments, latches 126 may have one of three possible states: available, shared acquired, and exclusively acquired. If no transaction 102 is currently accessing a record chain 110, its latch 126 is available for acquiring. If a transaction 102 is performing a read of a key-value record 112, the latch 126 may be acquired in a shared state—meaning that other transactions 102 can also acquire the latch 126 as long as they are also performing a read (i.e., not attempting to modify a record 112 while it is also being read). If a transaction 102 is performing a write, however, the latch 126 is acquired for the transaction 102 in an exclusive state—meaning no other transaction 102 may acquire the latch 126 until it is released. Accordingly, if two transactions 102 are attempting to perform writes for the same key, the later transaction is delayed until the former completes its write operation and releases the latch 126. If a transaction 102 is attempting to access multiple key-value pairs, latches 126 may be acquired in ascending order of buckets 124 to prevent deadlock. Although acquisition of latches 126 may be discussed primarily with respect to read and write operations, latches 126 may also be acquired when performing other operations such as defragmentation, garbage collection, flushing records 112 to the persistent store of database 108, etc. In some embodiments, latches 126 may also serve as a concurrency control mechanism for active transaction list 130 and skip list 140.
Active transaction list 130, in one embodiment, is a data structure that tracks various metadata for active transactions 102. In various embodiments, the metadata for a given transaction 102 includes a transaction identifier for the transaction 102 and one or more pointers usable to access records 112 associated with the transaction 102. In doing so, list 130 enables a transaction 102's records 112 to be identified based on its transaction identifier, which may be helpful when, for example, determining which records 112 should be removed if the transaction 102 is being rolled back. The metadata may also include an indication of whether a transaction is active or committed, which may be used to determine if its records 112 can be marked for flushing to database 108.
Skip list 140, in one embodiment, is a data structure that maintains an ordering of keys in records 112 to allow forward and reverse scanning of keys. In some embodiments, database 108 may be configured such that records 112 for committed transactions 102 are flushed in ascending key order (as well as version order); skip list 140 may allow this ordering to be quickly and easily determined. As will be described in greater detail below with respect to
As records 112 are inserted into buffer data structure 106, a scan of skip list 140 may be performed to determine where to insert the records in skip list 140. Once records 112 have been inserted, additional skip list scans may be performed to locate particular records 112 in skip list 140. In the illustrated embodiment, scan engine 150 is a component of transaction manager 104 that is executable to scan skip list 140 and may implement the fast scan algorithm discussed in more detail below starting with
The contents of records 112, including those used to implement skip list 140, will now be discussed in greater detail in order to facilitate better understanding of the fast scan algorithm discussed in detail later.
Turning now to
In the illustrated embodiment, record chain 110 is implemented using a linked list such that each key-value record 112 includes a pointer 219 identifying the next record 112 in the chain 110. When a record 112 is added, it is inserted at the head identified by the direct pointer 202 in the hash bucket 124 or appended to a collision record 220 discussed below. The added record 112 may then include a pointer 219 to the record that was previously at the head. As the record 112 becomes older, it migrates toward the tail (record 112B or lock record 230 in
Once a key-value record 112 has been successfully flushed to persistent storage, in some embodiments, transaction manager 104 sets a purge flag 216 to indicate that the record 112 is ready for purging from buffer data structure 106. In some embodiments, a purge engine may then read this flag 216 in order determine whether the record 112 should be purged from buffer data structure 106.
In some embodiments, collision records 220 are used to append records 112 to chain 110 when two different keys (e.g., keys 212A and 213C) produce the same hash value (i.e., a hash collision occurs) and thus share the same hash bucket 124. In various embodiments, the size of hash table 120 is selected to have a sufficient number of hash buckets 124 in order to ensure a low likelihood of collision. If a hash collision occurs, however, a record 220 may be inserted including pointers 222 to records 112 having different keys 212. Although, in many instances, a hash-bucket latch 126 is specific to a single respective key 212, in such an event, the hash-bucket latch 126 would be associated with multiple, different keys 212.
As noted above, in some embodiments, individual records 112 may also include their own respective locks 217 to provide additional coherency control. In some embodiments, a separate lock record 230 may also be inserted into record chains 110 to create a lock tied to a particular key when there is no corresponding value.
Skip list pointers 218, in one embodiment, are the pointers that form skip list 140. As will be discussed next with
Turning now to
When a particular key 212 is being searched in skip list 140, traversal of skip list 140 may begin, in the illustrated embodiment, at the top of the left most tower 300 (the location corresponding to bucket ID 312A1 in the illustrated embodiment), where the key 212 in record 112 is compared against the key being searched. If there is a match, the record 112 being searched for has been located. If not, traversal proceeds along the path of forward pointer 314A to another record 112 having another key 212, which is compared. If that key 212 is less than key 212 being searched for, traversal returns to the previous tower 300 and drops down to the next level in the tower 300 (the location of bucket ID 312A2 in
Although forward pointers 314 are depicted in
Before discussing the fast scan algorithm, it is instructive to consider how a less efficient scan algorithm is implemented.
Turning now to
In the example depicted in
As can be seen, this process continues for another twenty memory accesses until record R is identified as having a pointer 218 of bucket #17 to record—not including the additional memory accesses for using indirect pointers or the multiple accesses to move down a record chain 110. Furthermore, slow scan 400 may be performed multiple times to insert multiple records 112 associated with a given transaction. Moreover, in some embodiments, skip list 140 may include much taller skip list towers 300 (e.g., ones having 33 levels) and be substantially wider. All these memory accesses can affect system performance. In many instances, the fast scan algorithm discussed next uses far fewer memory accesses.
Turning now to
In the illustrated embodiment, fast scan 500 begins in step 510 with scan engine 150 using a prefix of a particular key to identify an anchor range within skip list 140 including the relevant location for the particular key. As noted above, an “anchor range” is a set/range of records in a skip list that have the same key prefix and is used to anchor where a skip list scan is initiated. As will be discussed below with
In step 520, this scan begins with scan engine 150 “climbing the mountain” from an anchor record to a record 112 having a skip list tower 300 that overshoots the relevant location. As will be discussed below with
In step 530, once an overshooting tower 300 has been identified, scan engine 150 performs a local skip list traversal from the overshooting tower 300 to identify the relevant location for the particular key. In various embodiments, this local skip list traversal may be implemented in a manner similar to slow scan 400, but without starting at a sentinel tower. If scan 500 is being performed to access a record 112, scan 500 may conclude with scan engine 150 accessing the contents of the record 112 once the traversal identifies its location. If a record 112 is being inserted at the location, scan 500 may proceed to step 540 to determine how skip list 140 should be modified to facilitate the insertion.
In step 540, scan engine 150 walks backwards from the identified location in the skip list 140 to identify skip-list pointers 314 relevant to the record insertion. As will be discussed with
In many instances, fast scan 500 results in far fewer memory accesses than slow scan 400 as a skip list traversal is only performed on a small subset of the total records 112 in skip list 140. Although some additional memory accesses may be performed to facilitate this local skip list traversal, employing techniques such as climbing the mountain and walking backwards for relevant pointer identification ensures that the number of additional memory access, in many instances, falls well below those needed to perform a slow scan 400.
Various examples of dividing skip list 140's key space using key prefixes will now be discussed.
Turning now to
In the example depicted in
Although partitioning 600A uses the same key prefix lengths across the entire key space, in some embodiments, different key prefix lengths may be used for different portions of the key space as will be discussed next.
Turning now to
In various embodiments, differing anchor plans 650 may be selected based on the underlying distribution of keys 212 in skip list 140. For example, a portion of a key space having a greater concentration of keys 212 may use additional prefixes 610. In some embodiments in which database system 10 implements a multi-tenant database, scan engine 150 may use different anchor plans 650 for different tenants. Still further, prefixes 610 may be selected to group records 112 having a common attribute (e.g., being included in the same database table) into the same anchor range 620. In some embodiments, scan engine 150 may also alter anchor plans 650 over time in order to improve performance. For example, in one embodiment, scan engine 150 may use a machine learning algorithm to determine appropriate anchor plans 650 for various divisions of the key space. Such alterations, however, may not be devastating for database system 10 as scan engine 150 can initially fall back to using slow scan 400 for inserting records 112 until a sufficient number of records 112 have been inserted under a new anchor plan 650 making fast scan 500 viable again. As the temporary performance loss for changing anchor plans likely affects only a small number of scans, such alterations may not be a big problem.
Turning now to
In the illustrated embodiment, scan engine 150 implements anchor record identification 700 by applying hash function 122 to a given prefix 610 to determine the bucket identifier 312 for the relevant hash bucket 124 (e.g., bucket 124N). If a corresponding anchor entry 710 exists in the bucket 124 (meaning that a corresponding anchor range 620 exists for that prefix 610), the included bucket identifier 312 for the anchor record 622 is read to determine the hash bucket 124 relevant to the anchor record 622. This bucket identifier 312 is then used to access anchor records 612's hash bucket 124 (e.g., bucket 124A). Scan engine 150 then traverses the pointer in the bucket 124 to the anchor chain 110 including the anchor record 622. If, however, no anchor entry 710 existed in the hash bucket 124 (meaning that a corresponding anchor range 620 does not yet exist), scan engine 150 may proceed to use the next largest key prefix 610 for the particular key 212 and continually repeat this process until a relevant anchor record 622 can be identified—or proceed to perform a slow scan 400 if no anchor entry 710 can be identified for any key prefix 610 of the key 212.
In various embodiments, scan engine 150 creates an anchor entry 710 for an anchor range 620 when an initial record 112 gets added to the range 620—the initial record 112 becoming the anchor record 622. As records 112 with lower keys 212 get added to the range 620 (or records 112 with lower keys 212 get purged), scan engine 150 updates anchor record bucket identifier 312 in the anchor entry 710 to point to hash buckets 124 of these new anchor records 622 (or the records 112 with the next lowest keys 212). If each record 112 for a particular anchor range 620 is purged, scan engine 150 may remove the corresponding anchor entry 710 from its hash bucket 124. In some instances, multiple anchor entries 710 may be modified for a given key 212 if its corresponding record 112 becomes the anchor record 622 for multiple prefixes 610.
In some rare instances, two prefixes 610 may map to the same hash bucket 124 due to a hash collision. Thus, scan engine 150 may compare the prefix 610 being hashed to the key prefix 610 in an anchor entry 710 in order to confirm that the anchor entry 710 is the correct one for the prefix 610 being hashed. In some embodiments, a given hash bucket 124 may include multiple anchor entries 710 if two prefixes 610 are associated with the same hash bucket 124. In another embodiment, an anchor entry 710 for a colliding prefix 610 may be stored in a collision record 220 pointed to by the hash bucket 124. In other embodiments, other techniques may be used to handle collisions such as preceding to use the next longest prefix 610 for a particular key 212 being scanned.
Turning now to
Turning now to
In various embodiments, this technique begins with scan engine 150 “climbing up the mountain”—meaning that the scan engine 150 scans forward from the anchor record 622 along the highest available pointers 314 in skip list 140's towers 300 until an overshooting tower 300 (i.e., tower 300 with a pointer 314 that points past/overshoots the location 624) can be identified. Accordingly, in the example depicted in
Once an overshooting tower 300 has been identified, scan engine 150 may “climb down the mountain” using a local skip-list traversal 810. In various embodiments, local traversal 810 is implemented in the same manner as the skip-list traversal in slow scan 400; however, traversal 810 is a “local” performance—meaning that is initiated from the overshooting tower 300, not the sentinel tower 300. Accordingly, in response to determining that pointer 314 “42” points past location 624 in the depicted example, scan engine 150 may traverse the next highest pointer 314 “15,” traverse pointer 314 “9,” and then, after overshooting, determine location 624. As a local traversal 810 also affords logarithmic scanning, the process of climbing up and climbing down the mountain is an efficient approach for determining a relevant location 624 once a relevant anchor range 620 has been identified.
If scan 500 is being performed to locate an existing record 112, scan 500 may conclude with scan engine 150 providing the record 112's contents. If, however, scan 500 is being performed to insert a record 112, scan engine 150 may need to perform additional work to determine what pointers 314 should be included in the record 112 and what pointers 314 should be altered in other records 112 as discussed next.
Turning now to
In the illustrated embodiment, scan engine 150 implements walking backwards 900 by sequentially scanning backwards along a lowest level in skip list 140 to identify pointers 314 in other records 112 that are relevant to the key-value record 112 being inserted into skip list 140. In various embodiments, scan engine 150 may initially determine the height of the new tower 300 for the record 112 being inserted in order to know how many levels will be in the new tower 300—and thus know the number of relevant pointers 314 for insertion into the tower 300. In some embodiments, scan engine 150 determines the tower 300's height using a power of 2 back off—meaning that skip list tower heights are assigned logarithmically with ½ being one tall (not including the lowest, reverse level), ¼ being two tall, and so on. In one embodiment, scan engine 150 may determine a tower 300's height by generating a random number and counting the number of consecutive 1s (or Os) from the most significant bit (or least significant bit). Once the tower height is determined, scan engine 150 may scan sequentially backwards until engine 150 has identified an overshooting pointer 314 for each level in the new tower 300. Accordingly, in the example depicted
In some instances, scan engine 150 may not perform walking backwards 900 if a record 112 being inserted has a tower 300 that is tall enough (e.g., in some embodiments, exceeding eight levels, which is a rare occurrence) to make walking backwards 900 a less efficient approach than performing a slow scan 400 to determine the set of relevant pointers 314. Accordingly, in some embodiments, prior to scanning sequentially backwards, scan engine 150 may determine a number of pointers 314 being included in the new records 112's tower 300. Based on the determined number, scan engine 150 determining whether to traverse skip list 140 using a slow scan 400 from a sentinel node or use walking backwards 900 to scan sequentially backwards from the identified location 624.
Once all the relevant pointers 314 have been identified, scan engine 150 may perform a record insertion as discussed next.
Turning now to
Because record chains 110 and/or records 112 have the potential to be modified during walking backwards 900 and record insertion 950, scan engine 150 may also acquire latches 126 and/or locks 217 for records 112 determined to have relevant pointers 314 to prevent modifications during walking 900 and insertion 950. Once each of the relevant pointers 314 have been identified, scan engine 150 may further perform a verification of the pointers 314 before performing the record insertion 950 in order to ensure that a concurrent record modification or another record insertion has not interfered with record insertion 950.
As noted above, in various embodiments, scan engine 150 may also need to add or update one or more anchor entries 710 in response to the inserted record 112 becoming an initial record 112 for an anchor range 620 or having the new lowest key 212 for one or more anchor ranges 620.
Various methods that use one or more of the techniques discussed above will now be discussed.
Turning now to
In step 1015, a skip list (e.g., skip list 140) is stored including a plurality of key-value records (e.g., key-value records 112) that include one or more pointers (e.g., skip list pointers 218/314) to others of the plurality of key-value records. In some embodiments, the skip list maintains an ordering of keys for key-value records stored in a buffer data structure that stores data for active database transactions.
In step 1020, the skip list is scanned for a location (e.g., location 624) associated with a particular key.
In sub-step 1022, the scanning includes using a prefix (e.g., key prefix 610) of the particular key to identify a particular portion (e.g., anchor range 620) of the skip list where the particular portion includes key-value records having keys with the same prefix. In various embodiments, an index is maintained that associates key prefixes with key-value records in respective portions having the same key prefixes, and using the prefix includes accessing the index to identify an initial key-value record (e.g., anchor record 612) in the particular portion from where to initiate the scan. In some embodiments, accessing the index includes applying a hash function (e.g., hash function 122) to the prefix to identify a hash bucket (e.g., hash function 124) including a pointer (e.g., bucket identifier 312) to the initial key-value record in the particular portion and traversing (e.g., local traversal 810) the pointer to the initial key-value record in the particular portion. In some embodiments, a first prefix (e.g., shorter key prefix 610A) of the particular key and a second, longer prefix (e.g., longer key prefix 610B) are selected to determine whether relevant portions (e.g., wider and narrower anchor ranges 620A and 620B) of the skip list exists for the first and second prefixes and, in response to determining that a relevant portion does exist in the skip list for the second, longer prefix, selecting the portion relevant to the second prefix (e.g., narrower anchor range 620B) for use in the scan for the location. In some embodiments, a length of the prefix to be used is determined by accessing a set of anchor plans (e.g., anchor plans 650) that specify prefix lengths to be used based on a portion of the particular key.
In sub-step 1024, the scanning includes initiating a scan for the location within the identified portion. In various embodiments, the scan includes scanning forward (e.g., climbing the mountain 800) through the key-value records in the particular portion along the highest available pointers in skip list towers of the key-value records until a key-value record is identified that includes a pointer (e.g., in an overshooting tower 300 in
In some embodiments, method 1010 further includes inserting a key-value record (e.g., inserted record 112 in
Turning now to
In step 1035, a skip list (e.g., skip list 140) is stored that maintains an ordering of keys (e.g., keys 212) for key-value records (e.g., records 112) of a database (e.g., database 108). In various embodiments, the skip list maintains the ordering of keys for key-value records of database transactions awaiting commitment by the database.
In step 1040, a location (e.g., location 624) is determined within the skip list associated with a particular key. In sub-step 1042, the determining includes using a prefix (e.g., key prefix 610) of the particular key to identify a range (e.g., anchor range 620) within the skip list, where the range includes key-value records having keys with the same prefix. In sub-step 1044, the determining includes initiating a scan for the location within the identified range. In some embodiments, the determining includes scanning forward through the identified range until a key-value record is identified that includes a skip list tower (e.g., overshooting tower 300 in
Turning now to
In step 1065, a skip list (e.g., skip list 140) is maintained that preserves an ordering of keys (e.g., keys 212) for a plurality of key-value records (e.g., records 112). In some embodiments, a first of the plurality of key-value records in the skip list indirectly points to a second of the plurality of key-value records by including a first pointer (e.g., bucket identifier 312) to a bucket in a hash table (e.g., hash table 120), where the bucket includes a second pointer to the second key-value record.
In step 1070, the skip list is divided (e.g., key space partitioning 600) into sections (e.g., anchor ranges 620) based on prefixes (e.g., prefixes 610) of the keys. In some embodiments, the dividing includes maintaining a hash table (e.g., hash table 120) that associates the prefixes with pointers (e.g., anchor record bucket identifiers 312) to the sections within the skip list.
In step 1075, a location (e.g., location 624) associated with a particular key is determined, by initiating a scan for the location within a section corresponding to a prefix of the particular key. In various embodiments, the determining includes scanning (e.g., climbing up in
In some embodiments, method 1060 further includes scanning backwards (e.g., walking backwards 900) along a lowest level in the skip list to identify pointers to insert in a skip list tower (e.g., new tower 300 in
Turning now to
MTS 1100, in various embodiments, is a set of computer systems that together provide various services to users (alternatively referred to as “tenants”) that interact with MTS 1100. In some embodiments, MTS 1100 implements a customer relationship management (CRM) system that provides mechanism for tenants (e.g., companies, government bodies, etc.) to manage their relationships and interactions with customers and potential customers. For example, MTS 1100 might enable tenants to store customer contact information (e.g., a customer's website, email address, telephone number, and social media data), identify sales opportunities, record service issues, and manage marketing campaigns. Furthermore, MTS 1100 may enable those tenants to identify how customers have been communicated with, what the customers have bought, when the customers last purchased items, and what the customers paid. To provide the services of a CRM system and/or other services, as shown, MTS 1100 includes a database platform 1110 and an application platform 1120.
Database platform 1110, in various embodiments, is a combination of hardware elements and software routines that implement database services for storing and managing data of MTS 1100, including tenant data. As shown, database platform 1110 includes data storage 1112. Data storage 1112, in various embodiments, includes a set of storage devices (e.g., solid state drives, hard disk drives, etc.) that are connected together on a network (e.g., a storage attached network (SAN)) and configured to redundantly store data to prevent data loss. In various embodiments, data storage 1112 is used to implement a database 108 comprising a collection of information that is organized in a way that allows for access, storage, and manipulation of the information. Data storage 1112 may implement a single database, a distributed database, a collection of distributed databases, a database with redundant online or offline backups or other redundancies, etc. As part of implementing the database, data storage 1112 may store one or more database records 112 having respective data payloads (e.g., values for fields of a database table) and metadata (e.g., a key value, timestamp, table identifier of the table associated with the record, tenant identifier of the tenant associated with the record, etc.).
In various embodiments, a database record 112 may correspond to a row of a table. A table generally contains one or more data categories that are logically arranged as columns or fields in a viewable schema. Accordingly, each record of a table may contain an instance of data for each category defined by the fields. For example, a database may include a table that describes a customer with fields for basic contact information such as name, address, phone number, fax number, etc. A record therefore for that table may include a value for each of the fields (e.g., a name for the name field) in the table. Another table might describe a purchase order, including fields for information such as customer, product, sale price, date, etc. In various embodiments, standard entity tables are provided for use by all tenants, such as tables for account, contact, lead and opportunity data, each containing pre-defined fields. MTS 1100 may store, in the same table, database records for one or more tenants—that is, tenants may share a table. Accordingly, database records, in various embodiments, include a tenant identifier that indicates the owner of a database record. As a result, the data of one tenant is kept secure and separate from that of other tenants so that that one tenant does not have access to another tenant's data, unless such data is expressly shared.
In some embodiments, the data stored at data storage 1112 includes buffer data structure 106 and a persistent storage organized as part of a log-structured merge-tree (LSM tree). As noted above, a database server 1114 may initially write database records into a local in-memory buffer data structure 106 before later flushing those records to the persistent storage (e.g., in data storage 1112). As part of flushing database records, the database server 1114 may write the database records 112 into new files that are included in a “top” level of the LSM tree. Over time, the database records may be rewritten by database servers 1114 into new files included in lower levels as the database records are moved down the levels of the LSM tree. In various implementations, as database records age and are moved down the LSM tree, they are moved to slower and slower storage devices (e.g., from a solid state drive to a hard disk drive) of data storage 1112.
When a database server 1114 wishes to access a database record for a particular key, the database server 1114 may traverse the different levels of the LSM tree for files that potentially include a database record for that particular key 212. If the database server 1114 determines that a file may include a relevant database record, the database server 1114 may fetch the file from data storage 1112 into a memory of the database server 1114. The database server 1114 may then check the fetched file for a database record 112 having the particular key 212. In various embodiments, database records 112 are immutable once written to data storage 1112. Accordingly, if the database server 1114 wishes to modify the value of a row of a table (which may be identified from the accessed database record), the database server 1114 writes out a new database record 112 into buffer data structure 106, which is purged to the top level of the LSM tree. Over time, that database record 112 is merged down the levels of the LSM tree. Accordingly, the LSM tree may store various database records 112 for a database key 212 where the older database records 112 for that key 212 are located in lower levels of the LSM tree then newer database records.
Database servers 1114, in various embodiments, are hardware elements, software routines, or a combination thereof capable of providing database services, such as data storage, data retrieval, and/or data manipulation. Such database services may be provided by database servers 1114 to components (e.g., application servers 1122) within MTS 1100 and to components external to MTS 1100. As an example, a database server 1114 may receive a database transaction request from an application server 1122 that is requesting data to be written to or read from data storage 1112. The database transaction request may specify an SQL SELECT command to select one or more rows from one or more database tables. The contents of a row may be defined in a database record and thus database server 1114 may locate and return one or more database records that correspond to the selected one or more table rows. In various cases, the database transaction request may instruct database server 1114 to write one or more database records for the LSM tree—database servers 1114 maintain the LSM tree implemented on database platform 1110. In some embodiments, database servers 1114 implement a relational database management system (RDMS) or object oriented database management system (OODBMS) that facilitates storage and retrieval of information against data storage 1112. In various cases, database servers 1114 may communicate with each other to facilitate the processing of transactions. For example, database server 1114A may communicate with database server 1114N to determine if database server 1114N has written a database record into its in-memory buffer for a particular key.
Application platform 1120, in various embodiments, is a combination of hardware elements and software routines that implement and execute CRM software applications as well as provide related data, code, forms, web pages and other information to and from user systems 1150 and store related data, objects, web page content, and other tenant information via database platform 1110. In order to facilitate these services, in various embodiments, application platform 1120 communicates with database platform 1110 to store, access, and manipulate data. In some instances, application platform 1120 may communicate with database platform 1110 via different network connections. For example, one application server 1122 may be coupled via a local area network and another application server 1122 may be coupled via a direct network link. Transfer Control Protocol and Internet Protocol (TCP/IP) are exemplary protocols for communicating between application platform 1120 and database platform 1110, however, it will be apparent to those skilled in the art that other transport protocols may be used depending on the network interconnect used.
Application servers 1122, in various embodiments, are hardware elements, software routines, or a combination thereof capable of providing services of application platform 1120, including processing requests received from tenants of MTS 1100. Application servers 1122, in various embodiments, can spawn environments 1124 that are usable for various purposes, such as providing functionality for developers to develop, execute, and manage applications. Data may be transferred into an environment 1124 from another environment 1124 and/or from database platform 1110. In some cases, environments 1124 cannot access data from other environments 1124 unless such data is expressly shared. In some embodiments, multiple environments 1124 can be associated with a single tenant.
Application platform 1120 may provide user systems 1150 access to multiple, different hosted (standard and/or custom) applications, including a CRM application and/or applications developed by tenants. In various embodiments, application platform 1120 may manage creation of the applications, testing of the applications, storage of the applications into database objects at data storage 1112, execution of the applications in an environment 1124 (e.g., a virtual machine of a process space), or any combination thereof. In some embodiments, application platform 1120 may add and remove application servers 1122 from a server pool at any time for any reason, there may be no server affinity for a user and/or organization to a specific application server 1122. In some embodiments, an interface system (not shown) implementing a load balancing function (e.g., an F5 Big-IP load balancer) is located between the application servers 1122 and the user systems 1150 and is configured to distribute requests to the application servers 1122. In some embodiments, the load balancer uses a least connections algorithm to route user requests to the application servers 1122. Other examples of load balancing algorithms, such as are round robin and observed response time, also can be used. For example, in certain embodiments, three consecutive requests from the same user could hit three different servers 1122, and three requests from different users could hit the same server 1122.
In some embodiments, MTS 1100 provides security mechanisms, such as encryption, to keep each tenant's data separate unless the data is shared. If more than one server 1114 or 1122 is used, they may be located in close proximity to one another (e.g., in a server farm located in a single building or campus), or they may be distributed at locations remote from one another (e.g., one or more servers 1114 located in city A and one or more servers 1122 located in city B). Accordingly, MTS 1100 may include one or more logically and/or physically connected servers distributed locally or across one or more geographic locations.
One or more users (e.g., via user systems 1150) may interact with MTS 1100 via network 1140. User system 1150 may correspond to, for example, a tenant of MTS 1100, a provider (e.g., an administrator) of MTS 1100, or a third party. Each user system 1150 may be a desktop personal computer, workstation, laptop, PDA, cell phone, or any Wireless Access Protocol (WAP) enabled device or any other computing device capable of interfacing directly or indirectly to the Internet or other network connection. User system 1150 may include dedicated hardware configured to interface with MTS 1100 over network 1140. User system 1150 may execute a graphical user interface (GUI) corresponding to MTS 1100, an HTTP client (e.g., a browsing program, such as Microsoft's Internet Explorer™ browser, Netscape's Navigator™ browser, Opera's browser, or a WAP-enabled browser in the case of a cell phone, PDA or other wireless device, or the like), or both, allowing a user (e.g., subscriber of a CRM system) of user system 1150 to access, process, and view information and pages available to it from MTS 1100 over network 1140. Each user system 1150 may include one or more user interface devices, such as a keyboard, a mouse, touch screen, pen or the like, for interacting with a graphical user interface (GUI) provided by the browser on a display monitor screen, LCD display, etc. in conjunction with pages, forms and other information provided by MTS 1100 or other systems or servers. As discussed above, disclosed embodiments are suitable for use with the Internet, which refers to a specific global internetwork of networks. It should be understood, however, that other networks may be used instead of the Internet, such as an intranet, an extranet, a virtual private network (VPN), a non-TCP/IP based network, any LAN or WAN or the like.
Because the users of user systems 1150 may be users in differing capacities, the capacity of a particular user system 1150 might be determined one or more permission levels associated with the current user. For example, when a salesperson is using a particular user system 1150 to interact with MTS 1100, that user system 1150 may have capacities (e.g., user privileges) allotted to that salesperson. But when an administrator is using the same user system 1150 to interact with MTS 1100, the user system 1150 may have capacities (e.g., administrative privileges) allotted to that administrator. In systems with a hierarchical role model, users at one permission level may have access to applications, data, and database information accessible by a lower permission level user, but may not have access to certain applications, database information, and data accessible by a user at a higher permission level. Thus, different users may have different capabilities with regard to accessing and modifying application and database information, depending on a user's security or permission level. There may also be some data structures managed by MTS 1100 that are allocated at the tenant level while other data structures are managed at the user level.
In some embodiments, a user system 1150 and its components are configurable using applications, such as a browser, that include computer code executable on one or more processing elements. Similarly, in some embodiments, MTS 1100 (and additional instances of MTSs, where more than one is present) and their components are operator configurable using application(s) that include computer code executable on processing elements. Thus, various operations described herein may be performed by executing program instructions stored on a non-transitory computer-readable medium and executed by processing elements. The program instructions may be stored on a non-volatile medium such as a hard disk, or may be stored in any other volatile or non-volatile memory medium or device as is well known, such as a ROM or RAM, or provided on any media capable of staring program code, such as a compact disk (CD) medium, digital versatile disk (DVD) medium, a floppy disk, and the like. Additionally, the entire program code, or portions thereof, may be transmitted and downloaded from a software source, e.g., over the Internet, or from another server, as is well known, or transmitted over any other conventional network connection as is well known (e.g., extranet, VPN, LAN, etc.) using any communication medium and protocols (e.g., TCP/IP, HTTP, HTTPS, Ethernet, etc.) as are well known. It will also be appreciated that computer code for implementing aspects of the disclosed embodiments can be implemented in any programming language that can be executed on a server or server system such as, for example, in C, C+, HTML, Java, JavaScript, or any other scripting language, such as VBScript.
Network 1140 may be a LAN (local area network), WAN (wide area network), wireless network, point-to-point network, star network, token ring network, hub network, or any other appropriate configuration. The global internetwork of networks, often referred to as the “Internet” with a capital “I,” is one example of a TCP/IP (Transfer Control Protocol and Internet Protocol) network. It should be understood, however, that the disclosed embodiments may utilize any of various other types of networks.
User systems 1150 may communicate with MTS 1100 using TCP/IP and, at a higher network level, use other common Internet protocols to communicate, such as HTTP, FTP, AFS, WAP, etc. For example, where HTTP is used, user system 1150 might include an HTTP client commonly referred to as a “browser” for sending and receiving HTTP messages from an HTTP server at MTS 1100. Such a server might be implemented as the sole network interface between MTS 1100 and network 1140, but other techniques might be used as well or instead. In some implementations, the interface between MTS 1100 and network 1140 includes load sharing functionality, such as round-robin HTTP request distributors to balance loads and distribute incoming HTTP requests evenly over a plurality of servers.
In various embodiments, user systems 1150 communicate with application servers 1122 to request and update system-level and tenant-level data from MTS 1100 that may require one or more queries to data storage 1112. In some embodiments, MTS 1100 automatically generates one or more SQL statements (the SQL query) designed to access the desired information. In some cases, user systems 1150 may generate requests having a specific format corresponding to at least a portion of MTS 1100. As an example, user systems 1150 may request to move data objects into a particular environment 1124 using an object notation that describes an object relationship mapping (e.g., a JavaScript object notation mapping) of the specified plurality of objects.
Turning now to
Processor subsystem 1280 may include one or more processors or processing units. In various embodiments of computer system 1200, multiple instances of processor subsystem 1280 may be coupled to interconnect 1260. In various embodiments, processor subsystem 1280 (or each processor unit within 1280) may contain a cache or other form of on-board memory.
System memory 1220 is usable store program instructions executable by processor subsystem 1280 to cause system 1200 perform various operations described herein. System memory 1220 may be implemented using different physical, non-transitory memory media, such as hard disk storage, floppy disk storage, removable disk storage, flash memory, random access memory (RAM—SRAM, EDO RAM, SDRAM, DDR SDRAM, RAMBUS RAM, etc.), read only memory (PROM, EEPROM, etc.), and so on. Memory in computer system 1200 is not limited to primary storage such as memory 1220. Rather, computer system 1200 may also include other forms of storage such as cache memory in processor subsystem 1280 and secondary storage on I/O Devices 1250 (e.g., a hard drive, storage array, etc.). In some embodiments, these other forms of storage may also store program instructions executable by processor subsystem 1280 to cause system 1200 to perform operations described herein. In some embodiments, memory 1220 may include transaction manager 104, scan engine 150, buffer data structure 106, and/or portions of database 108.
I/O interfaces 1240 may be any of various types of interfaces configured to couple to and communicate with other devices, according to various embodiments. In one embodiment, I/O interface 1240 is a bridge chip (e.g., Southbridge) from a front-side to one or more back-side buses. I/O interfaces 1240 may be coupled to one or more I/O devices 1250 via one or more corresponding buses or other interfaces. Examples of I/O devices 1250 include storage devices (hard drive, optical drive, removable flash drive, storage array, SAN, or their associated controller), network interface devices (e.g., to a local or wide-area network), or other devices (e.g., graphics, user interface devices, etc.). In one embodiment, computer system 1200 is coupled to a network via a network interface device 1250 (e.g., configured to communicate over WiFi, Bluetooth, Ethernet, etc.).
Although specific embodiments have been described above, these embodiments are not intended to limit the scope of the present disclosure, even where only a single embodiment is described with respect to a particular feature. Examples of features provided in the disclosure are intended to be illustrative rather than restrictive unless stated otherwise. The above description is intended to cover such alternatives, modifications, and equivalents as would be apparent to a person skilled in the art having the benefit of this disclosure.
The scope of the present disclosure includes any feature or combination of features disclosed herein (either explicitly or implicitly), or any generalization thereof, whether or not it mitigates any or all of the problems addressed herein. Accordingly, new claims may be formulated during prosecution of this application (or an application claiming priority thereto) to any such combination of features. In particular, with reference to the appended claims, features from dependent claims may be combined with those of the independent claims and features from respective independent claims may be combined in any appropriate manner and not merely in the specific combinations enumerated in the appended claims.
The present disclosure includes references to “an embodiment” or groups of “embodiments” (e.g., “some embodiments” or “various embodiments”). Embodiments are different implementations or instances of the disclosed concepts. References to “an embodiment,” “one embodiment,” “a particular embodiment,” and the like do not necessarily refer to the same embodiment. A large number of possible embodiments are contemplated, including those specifically disclosed, as well as modifications or alternatives that fall within the spirit or scope of the disclosure.
This disclosure may discuss potential advantages that may arise from the disclosed embodiments. Not all implementations of these embodiments will necessarily manifest any or all of the potential advantages. Whether an advantage is realized for a particular implementation depends on many factors, some of which are outside the scope of this disclosure. In fact, there are a number of reasons why an implementation that falls within the scope of the claims might not exhibit some or all of any disclosed advantages. For example, a particular implementation might include other circuitry outside the scope of the disclosure that, in conjunction with one of the disclosed embodiments, negates or diminishes one or more the disclosed advantages. Furthermore, suboptimal design execution of a particular implementation (e.g., implementation techniques or tools) could also negate or diminish disclosed advantages. Even assuming a skilled implementation, realization of advantages may still depend upon other factors such as the environmental circumstances in which the implementation is deployed. For example, inputs supplied to a particular implementation may prevent one or more problems addressed in this disclosure from arising on a particular occasion, with the result that the benefit of its solution may not be realized. Given the existence of possible factors external to this disclosure, it is expressly intended that any potential advantages described herein are not to be construed as claim limitations that must be met to demonstrate infringement. Rather, identification of such potential advantages is intended to illustrate the type(s) of improvement available to designers having the benefit of this disclosure. That such advantages are described permissively (e.g., stating that a particular advantage “may arise”) is not intended to convey doubt about whether such advantages can in fact be realized, but rather to recognize the technical reality that realization of such advantages often depends on additional factors.
Unless stated otherwise, embodiments are non-limiting. That is, the disclosed embodiments are not intended to limit the scope of claims that are drafted based on this disclosure, even where only a single example is described with respect to a particular feature. The disclosed embodiments are intended to be illustrative rather than restrictive, absent any statements in the disclosure to the contrary. The application is thus intended to permit claims covering disclosed embodiments, as well as such alternatives, modifications, and equivalents that would be apparent to a person skilled in the art having the benefit of this disclosure.
For example, features in this application may be combined in any suitable manner. Accordingly, new claims may be formulated during prosecution of this application (or an application claiming priority thereto) to any such combination of features. In particular, with reference to the appended claims, features from dependent claims may be combined with those of other dependent claims where appropriate, including claims that depend from other independent claims. Similarly, features from respective independent claims may be combined where appropriate.
Accordingly, while the appended dependent claims may be drafted such that each depends on a single other claim, additional dependencies are also contemplated. Any combinations of features in the dependent that are consistent with this disclosure are contemplated and may be claimed in this or another application. In short, combinations are not limited to those specifically enumerated in the appended claims.
Where appropriate, it is also contemplated that claims drafted in one format or statutory type (e.g., apparatus) are intended to support corresponding claims of another format or statutory type (e.g., method).
Because this disclosure is a legal document, various terms and phrases may be subject to administrative and judicial interpretation. Public notice is hereby given that the following paragraphs, as well as definitions provided throughout the disclosure, are to be used in determining how to interpret claims that are drafted based on this disclosure.
References to a singular form of an item (i.e., a noun or noun phrase preceded by “a,” “an,” or “the”) are, unless context clearly dictates otherwise, intended to mean “one or more.” Reference to “an item” in a claim thus does not, without accompanying context, preclude additional instances of the item. A “plurality” of items refers to a set of two or more of the items.
The word “may” is used herein in a permissive sense (i.e., having the potential to, being able to) and not in a mandatory sense (i.e., must).
The terms “comprising” and “including,” and forms thereof, are open-ended and mean “including, but not limited to.”
When the term “or” is used in this disclosure with respect to a list of options, it will generally be understood to be used in the inclusive sense unless the context provides otherwise. Thus, a recitation of “x or y” is equivalent to “x or y, or both,” and thus covers 1) x but not y, 2) y but not x, and 3) both x and y. On the other hand, a phrase such as “either x or y, but not both” makes clear that “or” is being used in the exclusive sense.
A recitation of “w, x, y, or z, or any combination thereof” or “at least one of . . . w, x, y, and z” is intended to cover all possibilities involving a single element up to the total number of elements in the set. For example, given the set [w, x, y, z], these phrasings cover any single element of the set (e.g., w but not x, y, or z), any two elements (e.g., w and x, but not y or z), any three elements (e.g., w, x, and y, but not z), and all four elements. The phrase “at least one of . . . w, x, y, and z” thus refers to at least one element of the set [w, x, y, z], thereby covering all possible combinations in this list of elements. This phrase is not to be interpreted to require that there is at least one instance of w, at least one instance of x, at least one instance of y, and at least one instance of z.
Various “labels” may precede nouns or noun phrases in this disclosure. Unless context provides otherwise, different labels used for a feature (e.g., “first circuit,” “second circuit,” “particular circuit,” “given circuit,” etc.) refer to different instances of the feature. Additionally, the labels “first,” “second,” and “third” when applied to a feature do not imply any type of ordering (e.g., spatial, temporal, logical, etc.), unless stated otherwise.
The phrase “based on” or is used to describe one or more factors that affect a determination. This term does not foreclose the possibility that additional factors may affect the determination. That is, a determination may be solely based on specified factors or based on the specified factors as well as other, unspecified factors. Consider the phrase “determine A based on B.” This phrase specifies that B is a factor that is used to determine A or that affects the determination of A. This phrase does not foreclose that the determination of A may also be based on some other factor, such as C. This phrase is also intended to cover an embodiment in which A is determined based solely on B. As used herein, the phrase “based on” is synonymous with the phrase “based at least in part on.”
The phrases “in response to” and “responsive to” describe one or more factors that trigger an effect. This phrase does not foreclose the possibility that additional factors may affect or otherwise trigger the effect, either jointly with the specified factors or independent from the specified factors. That is, an effect may be solely in response to those factors, or may be in response to the specified factors as well as other, unspecified factors. Consider the phrase “perform A in response to B.” This phrase specifies that B is a factor that triggers the performance of A, or that triggers a particular result for A. This phrase does not foreclose that performing A may also be in response to some other factor, such as C. This phrase also does not foreclose that performing A may be jointly in response to B and C. This phrase is also intended to cover an embodiment in which A is performed solely in response to B. As used herein, the phrase “responsive to” is synonymous with the phrase “responsive at least in part to.” Similarly, the phrase “in response to” is synonymous with the phrase “at least in part in response to.”
Within this disclosure, different entities (which may variously be referred to as “units,” “circuits,” other components, etc.) may be described or claimed as “configured” to perform one or more tasks or operations. This formulation—[entity] configured to [perform one or more tasks]—is used herein to refer to structure (i.e., something physical). More specifically, this formulation is used to indicate that this structure is arranged to perform the one or more tasks during operation. A structure can be said to be “configured to” perform some task even if the structure is not currently being operated. Thus, an entity described or recited as being “configured to” perform some task refers to something physical, such as a device, circuit, a system having a processor unit and a memory storing program instructions executable to implement the task, etc. This phrase is not used herein to refer to something intangible.
In some cases, various units/circuits/components may be described herein as performing a set of task or operations. It is understood that those entities are “configured to” perform those tasks/operations, even if not specifically noted.
The term “configured to” is not intended to mean “configurable to.” An unprogrammed FPGA, for example, would not be considered to be “configured to” perform a particular function. This unprogrammed FPGA may be “configurable to” perform that function, however. After appropriate programming, the FPGA may then be said to be “configured to” perform the particular function.
For purposes of United States patent applications based on this disclosure, reciting in a claim that a structure is “configured to” perform one or more tasks is expressly intended not to invoke 35 U.S.C. § 112(f) for that claim element. Should Applicant wish to invoke Section 112(f) during prosecution of a United States patent application based on this disclosure, it will recite claim elements using the “means for” [performing a function] construct.
The present application claims priority to U.S. Prov. App. No. 63/267,089, entitled “Fast Skip-List Scanning,” filed Jan. 24, 2022, which is incorporated by reference herein in its entirety.
Number | Date | Country | |
---|---|---|---|
63267089 | Jan 2022 | US |