The present disclosure relates to computer-implemented methods, software, and systems for allowing a client to control characteristics of a response to a data query using data classification.
Data processing applications generally allow clients to store and retrieve data from data storage systems, such as databases. Such data processing applications may allow arbitrarily large sets of data to be stored and retrieved by the client, for example, to be presented to a user or analyst. The client may define the data it wishes to access in the form of a query sent to the data processing application, which may in turn respond with one or more result sets associated with the received query.
The present disclosure relates to computer-implemented methods, software, and systems for allowing a client to control characteristics of a response to a data query using data classification. One computer-implemented method includes receiving a request for a data set from a client, the request including one or more request parameters indicating one or more characteristics for a result set, identifying a set of disjointed classes associated with the data set, the identification based at least in part on the one or more request parameters, the set of classes comprises the entire data set, associating the set of classes with a set of class representatives, each class in the set of classes associated with a class representative from the set of class representatives, and presenting the set of class representatives to the client.
While generally described as computer-implemented software embodied on tangible media that processes and transforms the respective data, some or all of the aspects may be computer-implemented methods or further included in respective systems or other devices for performing this described functionality. The details of these and other aspects and implementations of the present disclosure are set forth in the accompanying drawings and the description below. Other features, objects, and advantages of the disclosure will be apparent from the description and drawings, and from the claims.
The present disclosure generally relates to computer-implemented methods, software, and systems for allowing a client to control characteristics of a response to a data query using data classification.
In one aspect, the present disclosure describes a solution involving a client presenting a query associated with a resource to a server storing or managing that resource. Generally, the size of a response to such a query can be as large as the size of the requested resource. This unpredictability may lead to difficulties in processing the response. For example, the client may need to parse and/or convert the response into internal data formats. In some cases, the client may desire that the transferred response be processed in a given, limited amount of time. Because the processing resources of a client are limited in general but the cardinality of the response is unbound, the client may not be able to process the response at all, for instance because it does not fit into the client's memory, or because it would be too time consuming to do so. Further, the amount of information may be too excessive to present on particular clients such as, for example, a smartphone or tablet.
The present disclosure describes a solution that may aggregate data on the server in response to a query from the client, thus reducing the cardinality of the response. In some implementations, the server may aggregate many overlapping data points in a scatter plot into a single representative data point that indicates all data points. In some instances, this aggregation may be controlled by parameters included by the client with the query. The parameters may define a fixed upper boundary to the cardinality of a response to the query. The parameters may also define characteristics that indicate such an upper boundary, such as a maximum processing time for a response.
In some implementations, the server, in response to receiving a client query, aggregates all or a substantial plurality of data entries corresponding to the query into disjoint classes. Each class may represent a potentially empty set of data entries. In some cases, each class may be of a fixed size associated with or determined based on the parameters sent from or identified as associated with the client. The data classification may also be performed by any suitable algorithm or set of algorithms that ensure that responses to the query correspond to the parameters specified by the client.
In some implementations, the server associates each class with a class representative. Each class representative may include a set of keys that uniquely identify the class and, in some instances, a cardinality value indicating the number of data entries included in the class. In some instances, the server responds to the client query with the set of class representatives for the requested data set. The client may then send the client one or more of the class representatives in a subsequent request to retrieve the data entries associated with the classes indicated by the one or more class representatives.
Referring now to
The example environment 100 may include a data classification system 133. At a high level, the data classification system 133 comprises an electronic computing device operable to receive, transmit, process, store, or manage data and information associated with the environment 100. Specifically, the data classification system 133 illustrated in
As used in the present disclosure, the term “computer” is intended to encompass any suitable processing device. For example, although
The data classification system 133 also includes an interface 136, a processor 139, and a memory 151. The interface 136 is used by the data classification system 133 for communicating with other systems in a distributed environment—including within the environment 100—connected to the network 130, for example, the client 103, as well as other systems communicably coupled to the network 130 (not illustrated). Generally, the interface 136 comprises logic encoded in software and/or hardware in a suitable combination and operable to communicate with the network 130. More specifically, the interface 136 may comprise software supporting one or more communication protocols associated with communications such that the network 130 or interface's hardware is operable to communicate physical signals within and outside of the illustrated environment 100.
As illustrated in
The illustrated data classification system 133 also includes a data request server 150. In some implementations, the data request server 150 may receive queries from the one or more clients 103, perform data classification operations on the data set described by the query, and return class representatives to the client associated with the one or more classes identified by the data classification operations. The data request server 150 may be a single software program running on a single server, or it may be multiple software or hardware components distributed across one or more servers.
In the depicted implementation, the data request server 150 includes a query processing engine 152, a data classification engine 154, and a data retrieval engine 156. In some instances, the query processing engine 152 may be operable to process a query received from the one or more clients 103. This processing may include parsing the query to extract one or more request parameters specified by the client identifying one or more characteristics of a response. For example, a query received from the one or more clients 103 may include a structured query language (SQL) query specifying the data set the client wishes to retrieve, where the query includes or is associated with (e.g., including within the request including the query, etc.) a request parameter indicating the maximum cardinality of the set of classes into which to divide the data set. In this way, the client may specify the maximum size of the set of classes into which the data set will be divided. In another example, the request parameter may indicate the maximum cardinality of a set of data entries to be returned to the client in a response, thus allowing the client to specify the maximum number of data entries it will receive in response to a request.
In some implementations, the SQL query may be sent between the client 103 and the query processing engine 152 of the data classification system 133 using the Hypertext Transfer Protocol (HTTP). The SQL query may also be sent using any suitable protocol or mechanism, including, but not limited to, Simple Object Access Protocol (SOAP), JavaScript Object Notation (JSON), Remote Procedure Call (RPC), Extensible Markup Language (XML), Open Data Protocol (OData), or any other suitable protocol or mechanism or combination thereof.
In some instances, the query processing engine 152 may analyze the received query and perform one or more translations to alter the query for use by the data classification system 133. In some implementations, the query processing engine 152 may perform an analysis to determine the cardinality or size of the data set specified by the received query. This information may be used during data classification. In some cases, the query processing engine 152 may be a separate software or hardware process running within the data classification system 133. The query processing engine 152 may also be the component or module integrated within the data request server 150. In some instances, the query processing engine 152 may be a system external to data classification system 133.
In some cases, the data classification engine 154 may be located on a separate system external to the other components of the data classification system 133. Further, the present disclosure contemplates the various components of the data classification system 133 each being located on a separate system, all components being located on the same system, or come components being co-located on the same system and others being located on separate systems.
Data request server 150 may also include a data classification engine 154. In some implementations, data classification engine 154 may partition the data set identified by the query processing engine 152 into one or more classes. The one or more classes may be disjointed classes such that each data entry in the data set is included in one and only one of the one or more classes. Further, the one or more classes may completely cover the data set, such that each data entry in the data set is included in one of the one or more classes.
In some cases, the data classification engine 154 may execute an algorithm to perform the classification of the data set. In some instances, the algorithm may be operable to divide the data set into multiple equal size classes, each class corresponding to an equal size range of values. For example, one example algorithm might divide the data set into classes based only on the value of an attribute X. In such an example, a first class might include data entries having a value of X between 0 and 1, and a second class might include data entries having a value of X between 1 and 2. This algorithm is presented for exemplary purposes only, and the present disclosure contemplates the use of any suitable algorithm to perform this data classification, including, but not limited to, multidimensional classification algorithms classifying based on two or more attributes, classification algorithms that examine the data set to determine optimal classification schemes, classification algorithms specifically provided by the client, or any other suitable algorithm or combination of algorithms. In some implementations, the data classification engine 154 may perform classification on the data set in response to receiving a request for the data set. The data classification engine 154 may also perform classification of the data set prior to receiving a request, such as when the data is created or inserted. The data classification engine 154 may also perform classification each time data is requested, or may perform the classification once and use the pre-classified data to respond to requests.
The data classification engine 154 may also produce a set of class representatives corresponding to the set of classes produced during classification. In some implementations, each class representative corresponds to one class from the set of classes, and includes a set of keys uniquely identifying the associated class. Each conservator may also include other information about the associated class, such as, for example, a cardinality indicating the number of data entries included in the associated class.
In some instances, the data classification engine 154 may return the set of class representatives to the client 103 that sent the original request. In some implementations, the response may be sent to the client using the same protocol or mechanism used to communicate the original query between the client 103 in the query processing engine 152. The response may also be communicated by any other suitable protocol or mechanism.
In some implementations, classes may be defined using an algorithm that takes into account longitude and latitude for location based data and a parameter that determines granularity. Each geo-located data set can be represented by a class, and can be further broken by other criteria such as time or other criteria. In such a case, the number of classes is finite and bound by the granularity. For instance, incident data (related to car accidents, fires, etc) could be classified in this way. Average income, taxes, and other financial data could also be constrained to specific countries. In some cases, data could be classified by state (such as in the United States) or by federal state (in the European Union), by county, or by any other classification that denotes a well-defined grid.
The data request server 150 may further include a data retrieval engine 156. In some implementations, the data retrieval engine 156 may receive a request from the client 103 including one or more class representatives. The one or more class representatives included in the request may be a subset of the class representatives sent to the client by the data classification engine 154 in response to the original query. In response to receiving the request, the data retrieval engine 156 may process the one or more class representative to determine the classes to which they are associated. Once the data retrieval engine 126 determines the associated classes, the data retrieval engine 126 may retrieve the data entries associated with these classes, such as from the database 168. The data retrieval engine may then send these associated data entries to the client 103 in a response. In some implementations, each of the associated data entries may be sent to the client in a separate message. One or more associated data entries may also be sent a single message to the client. In some cases, the data retrieval engine 156 may return a reference or pointer to the client that the client may use to retrieve the one or more associated data entries. For example, the data retrieval engine 16 may return a result set object to the client 103. Using this result set object the client 103 may retrieve the data entries one at a time, may retrieve the entire data set it once, or may retrieve the data in any other desired fashion.
Regardless of the particular implementation, “software” may include computer-readable instructions, firmware, wired and/or programmed hardware, or any combination thereof on a tangible medium (transitory or non-transitory, as appropriate) operable when executed to perform at least the processes and operations described herein. Indeed, each software component may be fully or partially written or described in any appropriate computer language including C, C++, Java™, Visual Basic, assembler, Perl®, any suitable version of 4GL, as well as others. While portions of the software illustrated in
The data classification system 133 also includes a memory 151, or multiple memories 151. The memory 151 may include any type of memory or database module and may take the form of volatile and/or non-volatile memory including, without limitation, magnetic media, optical media, random access memory (RAM), read-only memory (ROM), removable media, or any other suitable local or remote memory component. The memory 151 may store various objects or data, including caches, classes, frameworks, applications, backup data, jobs, web pages, web page templates, database tables, repositories storing static and/or dynamic information, and any other appropriate information including any parameters, variables, algorithms, instructions, rules, constraints, or references thereto associated with the purposes of the data classification system 133. Additionally, the memory 151 may include any other appropriate data, such as VPN applications, firmware logs and policies, firewall policies, a security or access log, print or other reporting files, as well as others.
As illustrated in
Database 168 may include different data items related to classifying data to limit the size of response to a client. The illustrated database 168 includes a data set 170, one or more classes 174 associated with the data set 170, and one or more class representatives 176 associated with the one or more classes 174. In other implementations, the database 168 may contain any additional information necessary to support the particular implementation.
In the illustrated implementation, the database 168 includes a data set 170. In some implementations, the data set 170 is identified by the query processing engine 152 in response to receiving a query from the client 103. The data set 170 may correspond to the data responsive to the received query. In some implementations, the data set 170 is a subset of a larger data set included in database 168. The data set 170 may include one or more data entries representing individual records within the data set. In some implementations, the one or more data entries include rows within one or more tables within the database 168. The one or more data entries may also correspond to other structures within the database 168, such as objects, stored procedures, triggers, or any other data or metadata construct.
The database 168 may also include a set of classes 174. In some implementations, the set of classes 174 is produced by the data classification engine 154 to represent the results of its data classification operation. As described previously, the set of classes 174 may be a disjointed set such that none of the classes overlap, and may offer complete coverage of the data set 170 such that every data entry in the data set 170 is covered by exactly one class from the set of classes 174. In some implementations, the classes 174 may be stored as additional rows in tables in the database 168 specifying the boundaries of each class in the set of classes. The classes 174 may also be stored as temporary tables created by the data classification engine 154 during data classification. In some implementations, the temporary tables corresponding to the classes 174 may exist while the query associated with the classes is still valid or in scope, and may be deleted or otherwise purged when the query is no longer valid.
In some implementations, the database 168 may also include a set of class representatives 176. The class representatives 176 may be produced by the data classification engine 154 during data classification. As described previously, the class representatives 176 may each correspond to one of the classes 174, and may be returned to the client in response to the initial query. The client may subsequently retrieve the data entries associated with one of the classes 174 by specifying the corresponding class representative from the set of class representatives 176. In some implementations, the class representatives 176 are stored in the database 168 in a similar manner to the classes 174. The class representatives 176 may also be stored in a different manner than the classes 174.
The illustrated environment of
There may be any number of clients 103 associated with, or external to, the environment 100. For example, while the illustrated environment 100 includes one client 103, alternative implementations of the environment 100 may include multiple clients 103 communicably coupled to the data classification system 133 and/or the network 130, or any other number suitable to the purposes of the environment 100. Additionally, there may also be one or more additional clients 103 external to the illustrated portion of environment 100 that are capable of interacting with the environment 100 via the network 130. Further, the term “client” and “user” may be used interchangeably as appropriate without departing from the scope of this disclosure. Moreover, while the client 103 is described in terms of being used by a single user, this disclosure contemplates that many users may use one computer, or that one user may use multiple computers.
The illustrated client 103 is intended to encompass any computing device such as a desktop computer, laptop/notebook computer, wireless data port, smart phone, personal data assistant (PDA), tablet computing device, one or more processors within these devices, or any other suitable processing device. For example, the client 103 may comprise a computer that includes an input device, such as a keypad, touch screen, or other device that can accept user information, and an output device that conveys information associated with the operation of the data classification system 133 or the client 103 itself, including digital data, visual information, or a graphical user interface (GUI). In some implementations, the client may 103 may be an automated system using the data from the data classification system 133 for additional analysis with or without human interaction.
Circle 210 represents the full data set stored by the server. The data set 214 associated with the query 212 sent by the client to the server may be a subset of the query data set. The query 212 may be executed against the full data set 210 to produce a query data set 214. As described previously, the query data set 214 may be identified by executing the query 212 against the full data set 210. The server may receive the query 212 from the client, as described relative to example
The query set 214 is then classified (arrow 216) to produce a set of class representatives 218. As discussed previously, the class representatives 218 are associated with a disjoint set of classes covering the entire query data set 214. The set of class representatives 218 may then be sent to the client in a response (arrow 220), providing the client with a set of class representatives 222. The client may then send a select request (arrow 224) to the server, where the select request (arrow 224) includes one or more of the class representatives 222. The server may then execute the select request (arrow 224) against the query set 214. As discussed with regard to the examples of
The illustrated implementation includes a number of data points 306. Each data point 306 may represent one data entry in the data set, with the data point's position on the diagram relative to the x- and y-axes indicating its associated key values. In a multidimensional mapping, a data point's position relative to the multiple axes may represent values for the different keys associated with the axes. For example, in a three-dimensional mapping each of the data points 306 would appear as a point in the three-dimensional cube formed by the three axes of the graph.
The data graph 300 may include one or more partitions, such as those identified by reference numbers 308, 310, 312. In some implementations, the partitions represent the different classes into which the data set represented by the graph 300 has been divided. In the illustrated implementation, the classification scheme includes a grid with fixed size partitions. Each partition includes data entries with values of X and Y falling within a certain range. For example, partition 308 may include data entries with values of Y between 2 and 3 and values of X between 0 and 1, whereas partition 310 may include data entries with values of X between 0 and 1 and values of Y between 0 and 1.
In some implementations, the graph 300 shows the result of the classification operation such as that described with respect to
The illustrated graph 300 also includes an empty partition 312. In some implementations, the data classification operation may lead to classes including no data entries. Such classes may be represented by an empty set having a cardinality of 0.
Referring now to
At 602, the request is received for a data set from a client including request parameters indicating characteristics for a result set. In some implementations, the request may be received from the client over a network, as described relative to
At 604, a set of classes associated with the data set is identified based at least in part on the request parameters. As discussed relative to
At 606, the set of classes is associated with a set of class representatives, each class representative in the set of classes is associated with a class representative from the set of class representatives. In some implementations, each class representative may include a set of keys uniquely identifying the associated class, as well as other attributes associated with the class, such as, for example, a cardinality value. At 508, the set of class representatives is presented to the client. In some implementations, this may include sending the client one more messages via a network including the class representatives.
Referring now to
At 702, a request is received from the client including one or more class representatives from the set of class representatives. In some implementations, the one or more class representatives include the class representatives presented to the client at 508 of method 500. At 704, one or more classes are identified from the set of classes associated with the one or more receive class representatives. In some instances, this identification includes applying the inverse of a function used to generate the class representative to identify the associated class. The identification may also include performing a lookup into a database mapping the class representatives to their associated classes. At 706, a result set is presented to the client including a portion of the data set associated with the identified one or more classes.
The preceding figures and accompanying descriptions illustrate example processes and computer implementable techniques. But system 100 (or its software or other components) contemplates using, implementing, or executing any suitable technique for performing these and other tasks. It will be understood that these processes are for illustration purposes only and that the described or similar techniques may be performed at any appropriate time, including concurrently, individually, or in combination. In addition, many of the steps in these processes may take place simultaneously, concurrently, and/or in different orders than as shown. Moreover, system 100 may use processes with additional steps, fewer steps, and/or different steps, so long as the methods remain appropriate.
In other words, although this disclosure has been described in terms of certain implementations and generally associated methods, alterations and permutations of these implementations and methods will be apparent to those skilled in the art. Accordingly, the above description of example implementations does not define or constrain this disclosure. Other changes, substitutions, and alterations are also possible without departing from the spirit and scope of this disclosure.