Given current advances in network technology, high-bandwidth networks that allow large amounts of data to be transmitted to a destination are becoming more pervasive. These networks even include wireless networks or wireless access networks that allow transmission bursts of a large amount of data to a destination in a short period of time.
Given a high-bandwidth network, the problem is not so much how to quickly and efficiently transmit a large amount of data to a destination. Instead, situations may arise where a device receives a large amount of data in a short period of time, but due to the size of the data, the device cannot quickly and efficiently identify particular data of interest in the received data.
For example, in an emergency situation, emergency personnel receive a 200 GB data dump of medical records over a network for multiple injured people. If the receiving device is in the field, the device may not have the processing power or memory to quickly and efficiently identify vital information for an injured person from the 200 GBs of medical records. In another example, a real estate agent representing a buyer may download housing information meeting certain criteria for the buyer. However, because the information is organized from a seller's perspective, the real estate agent may miss certain listings or is unable to quickly identify information for the buyer. Thus, in these and other situations, due to the size and possibly the lack of organization of the transmitted data, the data are less usable to the receiving device and may, in some situations, be unusable, depending on the computing resources of the receiving device.
Embodiments are illustrated by way of example and not limited in the following Figure(s), in which like numerals indicate like elements, in which:
For simplicity and illustrative purposes, the principles of the embodiments are described by referring mainly to examples thereof. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the embodiments. It will be apparent however, to one of ordinary skill in the art, that the embodiments may be practiced without limitation to these specific details. In other instances, well known methods and structures have not been described in detail so as not to unnecessarily obscure the description of the embodiments.
According to an embodiment, a hierarchical multi-layer data package is encoded. The hierarchical multi-layer data package, also referred to as data package, is comprised of a plurality of layers arranged in a hierarchy. Each layer includes one or more subpackages of data comprising summaries and meta data that allows a device to quickly identify information of interest in a layer, i.e., “skim” and determine whether to decode data in the layer or whether to “drill down” to a lower layer to identify data of interest. Thus, decoding the data package comprises evaluating summaries and metadata in subpackages in a layer and determining whether to drill down to related subpackages in lower layers or decompress information in a current layer.
The devices 120 and 130 may include devices that are operable to communicate with other devices via a network or via a peer-to-peer connection. For example, the devices 120 and 130 may communicate with the server 110 via a client-server arrangement over a network, and the devices 120 and 130 may communicate with each other using a peer-to-peer protocol. Examples of the devices 120 and 130 may include a personal digital assistant, laptop, desktop, set top box, a vehicle including a computer system or substantially any device or apparatus including a computer system operable to perform the functions of the embodiments described herein. Communication between the devices 120 and 130 and the server 110 may include wired and/or wireless connections.
The information manager 140 provides information to the encoder 141 to encode data for transmission to another device and also provides information to the decoder 142 to decode received data. For example, the information manager 140 maintains a list of topics of interest for the device. It also identifies the level of detail that is desired for each of these topics, e.g., “executive briefing”, 500 word summary, white paper, all available raw data, etc. The information manager 140 also maintains current information about the state of computing resources, e.g., the processor utilization, the free memory space, etc. Using this information, the information manager 140 makes coding decisions. For example, the information manager 140 provides the encoder 141 with a compression ratio that represents the best trade-off between data package size and ease of use. One embodiment generates such advice based not only on the current computing resource measurements for the device, but also on future resource usage predictions for the device and other devices in its network.
The information manager 140 determines the hierarchical compression strategy for encoding the data. This may includes the compression ratio, the maximum number of subpackages of data for a given layer in a data package and other metadata. The maximum number of subpackages may be a function of the number of statistically significant/different data clusters, as well as the status of available computing resources and the “operations goal” of the network of devices which share data packages. The “operations goal” may be based on the attributes of the computing resources for the anticipated set of devices which will transmit and/or use the data package. For example, portable devices with less memory and processing power may set goals that best utilize their computing resources. In general, more clusters will use more sub-packages, thereby increasing the specificity of the data in the sub-package.
The information manager 140 also determines the maximum and target number of layers in the data package. This is a function of the overall size of the data package, as well as the status of available computing resources and the “operations goal” of the network of devices which share data packages. In general, larger data packages may use more layers, thereby reducing the amount of data that needs to be scanned at the top layer. Higher compression rates and computational efficiencies can be obtained with larger data packages. Therefore, in on embodiment, the largest data sub-packages possible are used at each level in the hierarchy. This is consistent with the “burst” (transmission) and “skim” (search) approach.
The encoder 141 is a hierarchical encoder. Modules in the encoder 141 are shown in
The encoder 141, according to an embodiment, includes a segmentation module 201, an aggregation module 202 and a compression module 203. The segmentation module 201 applies a segmentation algorithm, which may be selected by the information manager 140, to data previously selected to be encoded. The segmentation module 201 generates clusters of data, and keywords and/or other identifiers are established for each cluster.
The aggregation module 202 applies an aggregation algorithm, which may be selected by the information manager 140, to the clusters to generate summaries for the clusters. Summaries may be provided in XML. The layer of the data package is updated to include the summaries.
The compression module 203 applies a hierarchal compression strategy determined by the information manager 140. The compression module 203 may apply a compression algorithm selected by the information manager 140. Also, the compression module 203 may apply an archiving method selected by the information manager 140. The archiving method employs the compression algorithm to compress data at different layers of the data package.
One example of an archiving method is a sequential method. In the sequential method raw source data is archived at layer 1; the subpackages at layer 1 are archived at layer 2; the subpackages at layer 2 are archived at layer 3; etc. If minimizing the data package size is important, a sequential compressed method may be applied that compresses the summaries at the current level and stores them in the archive section of the data package. Only the keywords and other meta data are provided in the data package as uncompressed. Another archiving method for minimizing the data package is the differential method. In the differential method differences between the raw source data and summaries at layer 1 are archived at layer 1, differences between summaries at layer 1 and summaries at layer 2 are archived at layer 2, differences between summaries at layer 1 and summaries at layer 2 are archived at layer 2, etc. The encoder 141 also records relevant compute-time statistics which can assist in the selection of summaries and real-time decoding of archives in the future.
The decoder 142 is a hierarchal decoder. According to an embodiment, the decoder 142 includes an objective function module 301, a drill-down module 302 and a decompression module 303, as shown in
When decoding, if the information manager 140 determines that the subpackage 430 is relevant but more information is needed, the decoder 142 drills down to a lower layer. For example, the subpackage 420 is related to the subpackage 430 and the meta data and summary for the subpackage 430 is parsed to determine whether that subpackage contains data of interest for the user. If so, the data is decompressed. Meta data for each subpackage may identify related subpackages in higher or lower levels in the hierarchy to allow for efficiently identifying a related subpackage in another layer for drill down.
The data package shown in
In the data package shown in
The subpackages also include meta data. Meta data 605 and 606 are shown for the subpackages 1 and 2 respectively, and includes information regarding the segmentation, aggregation and compression used. For example, segmentation includes the identification of sections of the overall data set that relate to specific themes or topic clusters. A number of algorithms may be used to perform such clustering. For a data example, segmentation can be accomplished by applying a data mining method, e.g., rule induction, classification based on association (CBA), etc. The meta data for segmentation may identify the clustering algorithm used to create the clusters.
Aggregation creates the summaries for the subpackages. In this example, the aggregation creates text summaries for the source document shown in FIGS. 5A-B6. The summaries correspond to the clusters, which may be topics of interest, identified in the segmentation. The meta data for aggregation may identify the aggregation method used to create the summaries, such as a sentence extraction method. For a data example not including text, aggregation may be accomplished by creating statistical summaries of data at a given level of stratification, i.e., including one or more segments of the data. Alternatively, data may be aggregated by generating explicit numerical relations that summarize a set of data, e.g., by using gene expression programming (GEP), such as described in U.S. Pat. No. 7,127,436, entitled “Gene Expression Programming Algorithm”, assigned to Motorola, Inc., which is incorporated by reference in its entirety. For raw data the summary may be a best fit equation or a collection of compressed views into data. For a time series, the summary may be a timeline trend that is sampled less frequently then the raw data or the summary only shows data when there is significant changes.
The meta data may also identify the compression algorithm for compressing the document. Compression algorithms generally apply to any set of binary data. However, the information manager 140 may select a compression algorithm that is specifically tuned for good performance with certain types of data, e.g., text-only, JPEG image set, etc.
The meta data also includes an ID or a link to the compressed data. For example, if the information manager 140 determines that the subpackage includes data of interest to the user, the link, shown as <encoding param=“archive”>0</encoding>, is used to find and retrieve the compressed data from the data package.
The meta data also includes one or more keywords describing the cluster, which is the topic of interest in this example. For example, the cluster for subpackage 1 is described by the keywords “context” and “aware”.
The subpackages 1 and 2 include summaries 607 and 608 respectively. The summaries are created through the aggregation process. The summaries help identify whether the data for the subpackage is sufficient for the user or whether to select another subpackage or drill down to another layer. Note that the summaries include text from the source document in
The compressed data for layer 2 is shown as 609 in
Layer 3 also includes compressed data 712-714. Because the sequential archival method was used, layer 3 includes compressed data for lower-level layers 1 and 2. Other archival methods may store compressed data for the layer with the layer.
At step 801, data to be encoded is identified. For example, a set of files or some other set of data is selected for encoding. The data may be identified by a user or by other means.
At step 802, a hierarchal compression strategy is determined for encoding the data. The hierarchal compression strategy may include a target level of compression and preferred compression algorithms or archival methods based on intended recipients. For example, the compression strategy may be based on computing resource attributes for devices of intended recipients, negotiated policies and/or the number of topics or clusters.
At step 803, the selected data is divided into clusters. A segmentation algorithm may be used to generate the clusters.
At step 804, summaries are generated for the clusters, for example, using an aggregation algorithm. The summaries describe information in the clusters and may be used to identify information of interest to a user during decoding.
At step 805, the selected data associated with each cluster is compressed according to the hierarchical compression strategy. This may include implementing an archiving method, e.g., sequential, sequential compressed, differential, etc., to compress the data. Compression meta data may be generated and stored, such as compute time statistics that can be used for optimizing the decoding process in real-time.
At step 806, a layer in the data package is created including the summaries, meta data and compressed data. Examples of layers and the meta data are shown in
At step 807, a determination is made as to generate another layer. For example, the information manager 140 compares meta data for each subpackage to the hierarchal compression strategy selected by the information manager 140. If one or more of the desired compression rate, summary sizes, or keyword-based specificity of summaries, has been achieved, then the encoding is completed. If not, then steps 801-807 are repeated to create one or more other layers.
The system 900 includes a processor 902, providing an execution platform for executing software. Commands and data from the processor 902 are communicated over a communication bus 903. The system 900 also includes a main memory 906, such as a Random Access Memory (RAM), where software may reside during runtime, and a secondary memory 908. The secondary memory 908 may include, for example, a nonvolatile memory where a copy of software is stored. In one example, the secondary memory 908 also includes ROM (read only memory), EPROM (erasable, programmable ROM), EEPROM (electrically erasable, programmable ROM).
The system 900 includes I/O devices 910. The I/O devices may include a display and/or user interfaces comprising one or more I/O devices 910, such as a keyboard, a mouse, a stylus, speaker, and the like. A communication interface 913 is provided for communicating with other components. The communication interface 913 may be a wired or a wireless interface. The communication interface 913 may be a network interface. The components of the system 900 may communicate over a bus 909.
One or more of the steps of the methods described above and other steps described herein and software described herein may be implemented as software embedded or stored on a computer readable medium. The steps may be embodied by a computer program, which may exist in a variety of forms both active and inactive. For example, they may exist as software program(s) comprised of program instructions in source code, object code, executable code or other formats for performing some of the steps when executed. Modules include software, such as programs, subroutines, objects, etc. Any of the above may be stored on a computer readable medium, which include storage devices and signals, in compressed or uncompressed form. Examples of suitable computer readable storage devices include conventional computer system RAM (random access memory), ROM (read only memory), EPROM (erasable, programmable ROM), EEPROM (electrically erasable, programmable ROM), and magnetic or optical disks or tapes. Examples of computer readable signals, whether modulated using a carrier or not, are signals that a computer system hosting or running the computer program may be configured to access, including signals downloaded through the Internet or other networks. Concrete examples of the foregoing include distribution of the programs on a CD ROM or via Internet download. In a sense, the Internet itself, as an abstract entity, is a computer readable medium. The same is true of computer networks in general. It is therefore to be understood that those functions enumerated herein may be performed by any electronic device capable of executing the above-described functions.
While the embodiments have been described with reference to examples, those skilled in the art will be able to make various modifications to the described embodiments without departing from the true spirit and scope. The terms and descriptions used herein are set forth by way of illustration only and are not meant as limitations. In particular, although the methods have been described by examples, steps of the methods may be performed in different orders than illustrated or simultaneously. Those skilled in the art will recognize that these and other variations are possible within the spirit and scope as defined in the following claims and their equivalents.
This patent application is related to U.S. patent application Ser. No. (TBD)(Attorney Docket No. CML06484BLUE), entitled “Decoding a Hierarchical Multi-Layer Data Package” by Tirpak, which is incorporated by reference in its entirety.