There is a continual need for data storage and retrieval capabilities. Data storage and retrieval systems that allow for real-time data retrieval and low-latency read operations, that also allow for utilization and visualization of stored data, are desired.
Aspects of the current subject matter relate to spatial and temporal data storage and retrieval.
A method, in accordance with an implementation of the current subject matter, includes receiving, by a processing device associated with a storage structure, a storage request including samples from a user device in communication with the processing device. The processing device stores, in a database table of the storage structure, a transaction that includes a transaction structure representing the storage request and including a list of the samples. The processing device processes the transaction to enable subsequent user read operations. The processing includes dividing the samples into one or more tiles according to criteria based upon spatial and/or temporal factors of the samples and the one or more tiles; in response to a determination that, for a given one of the one or more tiles, a number of samples exceeds a predefined threshold, compositing the number of samples in the given one of the one or more tiles; and storing, in the storage structure, the composited number of samples.
The details of one or more variations of the subject matter described herein are set forth in the accompanying drawings and the description below. Other features and advantages of the subject matter described herein will be apparent from the description and drawings, and from the claims. The claims that follow this disclosure are intended to define the scope of the protected subject matter.
The accompanying drawings, which are incorporated in and constitute a part of this specification, show certain aspects of the subject matter disclosed herein and, together with the description, help explain some of the principles associated with the disclosed implementations. In the drawings:
When practical, similar reference numbers denote similar structures, features, or elements.
Aspects of the current subject matter relate to a data storage and retrieval system that employs a data storage structure in which user data is spatially and temporally stored and fetched in real-time, allowing users to utilize and visualize the data. The data storage structure, in accordance with implementations of the current subject matter, provides for a structured query and retrieval mechanism and is optimized for low-latency read operations covering a temporal and spatial range, while also allowing for varying play rates and spatial zoom levels. The data storage structure incorporates data stores for query retrieval and allows for easily changing and updating search and storage algorithms.
With reference to
In accordance with implementations of the current subject matter described herein, the devices 110a,b,c,d provide data to the storage and retrieval processor 120. The storage and retrieval processor 120 stores the received data and subsequently retrieves the data for utilization and/or presentation to a corresponding one of the devices 110a,b,c,d. For example, a particular device 110a provides the storage and retrieval processor 120 with data, and later, upon the device 110a requesting the data or submitting a query (e.g., through one or more request instructions or signals, which may be user-driven, provided to the storage and retrieval processor 120 from the device 110a), the storage and retrieval processor 120 retrieves the data for utilization and/or presentation to the device 110a (e.g., to a user of the device 110a). Additionally, rather than a particular user requesting the data or initiating a query, the storage and retrieval processor 120 may compile reports, presentations, or the like relating to the data for transmission to the device 110a, which may be stored in the device 110a for later viewing, manipulation, and/or processing by the device 110a.
Storage 140 is associated with the storage and retrieval processor 120. The storage 140 may be separate and remote from the storage and retrieval processor 120 or may be integrated within the storage and retrieval processor 120. Alternatively, portions of the storage 140 may be integrated with the storage and retrieval processor 120 while other portions are remotely located. The storage 140 may be part of the cloud. The storage 140 includes multiple database tables in a defined storage structure to store the data provided by the devices 110a,b,c,d, consistent with implementations of the current subject matter as described herein. The defined storage structure of the storage 140 includes but is not limited to a tile store, an entity store, and a metadata store, consistent with some implementations of the current subject matter described herein.
In accordance with some implementations of the current subject matter described herein, an entity may be defined as a person, place, or thing that exists in space and time. An entity may be represented as a series of samples that define properties of that entity at a single instant in time. A sample may be represented by, for example, a JavaScript Object Notation (JSON) structure containing an ID, the sample time, the sample location, and any other properties that describe the entity at that instant in time. For example, a sample representing a real-time status of a package shipment may be represented as follows:
In accordance with implementations of the current subject matter described herein, a dataset is a named collection of samples for one or more entities. In certain implementations, data storage and retrieval by the storage and retrieval processor 120 is indexed by dataset.
Implementations of the current subject matter provide a storage structure that achieves low-latency responses to various types of queries and/or requests. Sample data is stored in different formats in multiple database tables. The different representations of data are, in accordance with some implementations, completely transparent to the end user (e.g., the end user has access to each of the different representations of data). In accordance with other implementations, the different representations of data may be semi-transparent to the end user (e.g., the end user is able to see portions of the different representations of data) or nontransparent to the end user.
The transaction store holds unmodified user requests (“transactions”). Storage requests may begin by the request, received by the storage and retrieval processor 120, being directly stored into the transaction store. Each request is represented by a transaction structure, which may contain, for example, an operation type, a list of samples, and/or a valid flag. Operation types may define what the user is intending to do with the specified samples; for example, inserting new samples, modifying existing samples, and/or removing existing samples. The valid flag is, as an example, a Boolean value set to true when the transaction is determined to contain well-formatted data and can thus be processed into the other data stores. Invalid transactions (e.g., the Boolean value set to false) may be visible within the transaction store, but since their data cannot be processed, they are not represented within other data stores, according to some implementations.
A transaction is stored in a database table, along with a sequence number. Each sequence number/transaction value is stored in a separate row within the database table. Transactions may be stored in a compacted binary form in a single database column. Sequence numbers are represented with an integer column.
All transactions are, according to some variations of the current subject matter, applied in order. Transactions are generally unmodifiable: users may view the contents of the transaction store and create new transactions, but they may not alter an existing transaction.
After a transaction has been added to the transaction store, an asynchronous job begins to apply that transaction to the downstream stores, further described herein. The first step of the asynchronous job is, according to one aspect of the current subject matter, to validate the transaction. While an initial validation may be performed during the user request (resulting in the Boolean value of the valid flag in the transaction structure), a subsequent, more thorough validation can optionally be made. The initial validation is performed before responding to the user request and thus is preferably completed within a short time period. The subsequent validation stage is performed asynchronously and may require that users poll the status of the operation to determine when/if it has been completed. The subsequent validation process may be unique for each type of transaction operation, but it may typically involve checking the existing data to determine whether or not the selected or identified operation types are legal. For example, it is not valid (e.g., legal) to remove a sample that was never added in the first place.
After the second, subsequent validation of the transaction, the transaction is applied to a tile store, in accordance with implementations of the current subject matter. The tile store allows for efficient access to a range of data covering a spatial and temporal range. Within the tile store, samples are divided into tiles and indexed based upon the tile's location in space and time. The tile store is optimized for low-latency read operations covering a temporal and spatial range while also allowing for varying play rates and spatial zoom levels. The dividing of the samples into one or more tiles is in accordance with criteria relating to spatial and/or temporal factors of the samples and the tiles, as further described herein.
A tile is represented as an array of samples, stored in a compact binary format, and indexed by a tile index. A tile index is a set of integers specifying the tile's location. For example, a set of four integers may specify the tile's location in the x, y, z, and time dimensions. Additionally, two more integers may be used to specify the tile's spatial and temporal zoom levels. These integers may be combined to define an example of the index for the tile data.
In accordance with implementations of the current subject matter, the spatial and temporal zoom levels define the size of a tile. In some variations, spatial zoom level “0” is defined such that each tile is 360.0 units wide (the width of the earth in degrees longitude). Each zoom level above that yields a tile half as wide. For example, spatial zoom level “1” is 180.0 units wide, and spatial zoom level “2” is 90.0 units wide. Negative spatial zoom levels are also allowed. Spatial zoom level “−1” is 720.0 units wide. Temporal zoom level is defined similarly. In some variations, temporal zoom level “0” is defined to be one year in width. Temporal zoom level “1” is thus one half of a year, and temporal zoom level “−1” is two years. Positions in space and time can be converted into tile indices using the width of the tile. Tile indices can be negative. These particular spatial and temporal zoom level definitions defined herein are merely exemplary, and other spatial and temporal zoom level definitions may be applied. Selection of the base tile size is driven by the desired spatial and temporal ranges to be represented within the tile store, as well as the constraints of the computing environment's representation of numbers. For example, representing timespans spanning the life of the universe (billions of years) would yield significantly different tile sizes than representations for the life of a subatomic particle (tiny fractions of a second).
With reference to
If the determined zoom levels are not less than the current minimum levels (as determined at 215; e.g., the determined ideal minimum zoom level is not new), then the process skips 220 and 225 and continues at 230.
After ensuring tiles exist at the minimum zoom level, at 230, the samples from the new transaction are divided into separate tiles at the minimum zoom levels and each tile is updated. Updates may be accomplished by reading the existing tile (at 235) (e.g., resulting in tiles 240) and merging new samples at 245. The merging of new samples may include adding the new samples to the sample list, and writing the updated sample list back into the database.
When the number of samples in a single tile becomes too large, latency of reading that tile becomes high as well. In order to provide users with the minimum possible read latency, in accordance with implementations of the current subject matter, two actions may be taken: 1. The samples within a tile are composited; and 2. The tile is subdivided into multiple tiles at a higher spatial and/or temporal zoom level.
“Too large” may be defined by the read latency of the underlying database, which directly correlates with the size of the represented tile data (e.g., in bytes). The number of samples contained within the tile can be used as an approximation to this size. A threshold value of 8,000 samples, for example, could be used to determine when a tile has become too large. Other threshold values can also be utilized.
Compositing algorithms, in accordance with implementations described herein, may be applied based on a quantity of samples in a given tile. When the number of samples in a tile exceeds a predefined number, then compositing is done to maintain performance, thus reducing sample count. Anytime a tile is composited, it is subdivided at a different zoom level (e.g., each dimension is divided in half as needed). The subdivided tiles are used on the read operation side: when a user requests data spanning a particular spatial range and temporal range/play rate. The correct choice of zoom level at which to look for data is computed (e.g., determine which zoom levels exist at a dataset and correlate those to minimize the number of tiles read).
According to implementations of the current subject matter, compositing is the process of reducing the number of samples being written into a tile by combining some of the samples in an intelligent manner. Implementations of the current subject matter provide two forms of compositing: temporal compositing and spatial compositing.
In temporal compositing, samples are filtered so that the rate of updates to a given entity does not exceed a given threshold. This threshold is determined by the temporal zoom level. As the temporal zoom level decreases, the update frequency also decreases. As a result, data stored in low temporal levels is best suited for high play rates, in which users would not be able to perceive all of the individual updates to the entities.
In an example implementation, the threshold may be set to 30 Hz. That is, no entity will have more than 30 samples per second at the user's specified play rate. This prevents data from being queried at a rate higher than the rate at which the user can reasonably view it. For example, if a user queries for data spanning a 10 second time range at a play rate of twice real-time, the resulting data will be viewed in a span of five seconds. Thus no more than 30*5=150 samples should be returned for any entity. The temporal zoom level for the request is thus set to ensure that there are approximately 150 samples over the requested time range.
With reference to
In accordance with some implementations of the current subject matter, if temporal compositing does not reduce the number of samples enough, spatial compositing may also be performed. In spatial compositing, samples are combined based on location. The spacing of the combined samples is determined by the spatial zoom level. As the spatial zoom level decreases, the distance between compositing points increases. As a result, data stored in low spatial zoom levels is best suited for large spatial regions in which it is difficult for a user to distinguish many individual data points. Spatial compositing, in accordance with implementations of the current subject matter, allows for highly dense datasets when zoomed out and played at high rates.
With reference to
Referring again to
If it is determined that compositing is necessary to reduce the size of a single tile, that tile is subdivided in either or both the spatial or temporal dimension. If temporal compositing results in a reduction of the sample set, the original sample set (before compositing was applied) is inserted into a new tile at a temporal zoom level one higher than the current temporal zoom level. This results in up to two new titles being created, each covering one half the time range of the original tile. These new tiles may be referred to herein as “children” of the original tile. The original tile may be referred to herein as the “parent” of the new tiles. The children tiles may be written following the same process that was used to create the parent tile. That is, the sample set will be divided into tiles. These may be composited and thus may lead to further subdivision. Since each child tile only covers one half the time range, the number of samples that need to be written to the tile are on average half as many as in the parent tile. This may yield tiles in which no compositing is required, and thus prevent further subdivision. If one of the tiles contains more entities than the other, further compositing may be required, leading to further subdivision. This process may continue until no further subdivision is required, no further subdivision is possible, or a maximum temporal zoom level has been reached.
As previously described, the tile store is optimized for low latency read operations. A user (e.g., one of the devices 110a,b,c,d or a user thereof) may provide a time range, a spatial range, a play rate, and/or a spatial zoom level as inputs. The tile store then operates, as described herein, to return a list of samples that cover at least the specified range. The correct set of tiles to read must first be computed to return the list of samples. With reference to
At 615, after the zoom levels have been determined, the set of tiles to read is computed. This includes the set of tiles at the ideal zoom levels that completely covers the user-requested temporal and spatial ranges.
Due to the approach taken, consistent with implementations described herein, to store data in tiles, some tiles may be subdivided into tiles at higher zoom levels while others may not have such a subdivision. As a result, the system must map the ideal tiles to read to the actual tiles in the dataset. This may be achieved by maintaining a tile index for each dataset. The tile index specifies which tiles are contained in the dataset. At 620, the tile index is read. At 625, each of the ideal tiles is mapped to its best corresponding tile, if any, using this index. After finding the list of best corresponding tiles, the list may be filtered to remove potential duplicates.
At 630, the tiles are read from the resulting list. At 635, the results are combined into a single list of samples, which are returned to the user (640).
In
The series of examples in
In
When the tile 1000 from the previous image is subdivided, two new tiles 1005 and 1010, each covering one half (1050a,b) the total time range 1050, are created as shown in
When the tile 1005 from the previous image is subdivided, two new tiles 1005-1 and 1005-2, each covering one half (1050a-1,a-2) of the total time range 1050a, are created as shown in
Returning back to the overall processing of the transaction, and the asynchronous job applying the transaction to the downstream stores, after being processed in the tile store, the transaction is applied to an entity store, in accordance with implementations of the current subject matter. The entity store allows for efficient access to all samples of a specific entity. Samples are stored in a compressed binary form in a table indexed by the identifier and time fields. Each sample from a transaction is added as a new row in this table, replacing the prior sample if necessary.
With reference to
Reading from the entity store requires an entity identifier and optionally a specific time. Queries are direct database lookups. With reference to
After being processed in the entity store, the transaction is applied to a metadata store, in accordance with implementations of the current subject matter. The metadata store holds useful statistics about a dataset. The statistics contain information of general interest, such as the absolute minimum and maximum temporal and spatial bounds. Metadata may be stored in a single database table. Metadata for a dataset may be represented by a single record in that table, with each column representing a specific statistic.
With reference to
Reading metadata requires the identifier of the dataset. This identifier is used to fetch the corresponding record from the database, which is then returned to the user. With reference to
Now turning to
At 1905, a storage request from a particular device 110a,b,c,d is received by the storage and retrieval processor 120.
At 1910, the storage and retrieval processor 120 performs an initial validation on the storage request. The initial validation serves to determine if the storage request contains well-formatted data and can thus be processed into the various data stores of the storage and retrieval processor 120. If the validation determination indicates that the storage request is not valid, an error is returned to the particular device 110a,b,c,d at 1915. If the validation determination indicates that the storage request is valid, a transaction structure representing the request is stored at 1920.
At 1925, after a transaction has been added to the transaction store, an asynchronous job begins. At 1930, the asynchronous job is returned to a user (e.g., a user of the particular device 110a,b,c,d or to the particular device 110a,b,c,d itself). That is, in some implementations, an identifier representing the asynchronous job may be returned to the user. This allows the user to poll for updates as to the status of that asynchronous job.
At 1935, a detailed validation, subsequent to the initial validation at 1910, of the transaction is performed. This subsequent validation stage is performed asynchronously and may involve checking the existing data to determine whether or not the selected or identified operation types are legal, for example. If the subsequent validation indicates that the transaction is not valid, an error is returned to the user at 1940.
If the transaction is determined to be valid at 1935, then at 1945 the transaction is processed in the tile store, consistent with implementations described herein. At 1950, the transaction is processed in the entity store, followed by processing in the metadata store at 1955, both of which are described herein.
At 1960, following the asynchronous job applying the transaction to the downstream stores, a success indication is returned to the user.
Although various illustrative embodiments are described above, any of a number of changes may be made to various embodiments without departing from the scope of the invention as described by the claims. For example, the order in which various described method steps are performed may often be changed in alternative embodiments, and in other alternative embodiments one or more method steps may be skipped altogether. Optional features of various device and system embodiments may be included in some embodiments and not in others. Therefore, the foregoing description is provided primarily for exemplary purposes and should not be interpreted to limit the scope of the invention as it is set forth in the claims.
One or more aspects or features of the subject matter described herein can be realized in digital electronic circuitry, integrated circuitry, specially designed application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs) computer hardware, firmware, software, and/or combinations thereof. These various aspects or features can include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which can be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device. The programmable system or computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
These computer programs, which can also be referred to programs, software, software applications, applications, components, or code, include machine instructions for a programmable processor, and can be implemented in a high-level procedural language, an object-oriented programming language, a functional programming language, a logical programming language, and/or in assembly/machine language. As used herein, the term “machine-readable medium” refers to any computer program product, apparatus and/or device, such as for example magnetic discs, optical disks, memory, and Programmable Logic Devices (PLDs), used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term “machine-readable signal” refers to any signal used to provide machine instructions and/or data to a programmable processor. The machine-readable medium can store such machine instructions non-transitorily, such as for example as would a non-transient solid-state memory or a magnetic hard drive or any equivalent storage medium. The machine-readable medium can alternatively or additionally store such machine instructions in a transient manner, such as for example as would a processor cache or other random access memory associated with one or more physical processor cores.
To provide for interaction with a user, one or more aspects or features of the subject matter described herein can be implemented on a computer having a display device, such as for example a cathode ray tube (CRT) or a liquid crystal display (LCD) or a light emitting diode (LED) monitor for displaying information to the user and a keyboard and a pointing device, such as for example a mouse or a trackball, by which the user may provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well. For example, feedback provided to the user can be any form of sensory feedback, such as for example visual feedback, auditory feedback, or tactile feedback; and input from the user may be received in any form, including, but not limited to, acoustic, speech, or tactile input. Other possible input devices include, but are not limited to, touch screens or other touch-sensitive devices such as single or multi-point resistive or capacitive trackpads, voice recognition hardware and software, optical scanners, optical pointers, digital image capture devices and associated interpretation software, and the like
The examples and illustrations included herein show, by way of illustration and not of limitation, specific embodiments in which the subject matter may be practiced. As mentioned, other embodiments may be utilized and derived there from, such that structural and logical substitutions and changes may be made without departing from the scope of this disclosure. Such embodiments of the inventive subject matter may be referred to herein individually or collectively by the term “invention” merely for convenience and without intending to voluntarily limit the scope of this application to any single invention or inventive concept, if more than one is, in fact, disclosed. Thus, although specific embodiments have been illustrated and described herein, any arrangement calculated to achieve the same purpose may be substituted for the specific embodiments shown. This disclosure is intended to cover any and all adaptations or variations of various embodiments. Combinations of the above embodiments, and other embodiments not specifically described herein, will be apparent to those of skill in the art upon reviewing the above description.