This invention relates generally to a multi-tenant SaaS environment and, more specifically, to a system and method for extracting large customer data volumes at high speed from an external multi-tenant SaaS environment.
The present invention relates to extracting customer data from an external multi-tenant SaaS environment. The data may be extracted for backup or providing availability outside the SaaS environment (e.g., an ultra-high availability cloud emulator). Examples of SaaS providers include SALESFORCE, ORACLE, and MICROSOFT. SaaS customers do not own or control the infrastructure on which the software is run and on which their data is stored. They just have access to the service.
The problem is that there is a large amount of data that has to go through the SaaS provider's APIs, where the APIs can be likened to a narrow “tube” for the data. In addition, the SaaS provider has limits on API use. For example, the limits may be based on the type of package the customer has purchased from the SaaS provider. The customer also needs to use its allocated API bandwidth for accessing the SaaS software application (i.e., reading from and writing to the SaaS environment). Therefore, any backup must comply with the SaaS provider's API limits and also not interfere with the customer's ability to use the SaaS application. The backup also has time constraints. For example, if you want to have daily backups, then the backup cannot take more than one day. Also, the more time it takes to backup data, the more inconsistencies there will be between the backup and the customer's actual data. The customer may also be making changes to the data while the backup is taking place.
While the SaaS provider has a number of APIs, there is a need for a system and method for determining which data to extract using which API so that the data can be extracted efficiently without violating the SaaS provider and customer API constraints.
The present disclosure describes a system, method, and computer program for extracting large customer data volumes at high speed from an external multi-tenant SaaS environment. The method is performed by a computer system that includes servers, storage systems, networks, operating systems, and databases.
The present invention provides a system and method for extracting large customer data volumes at high speed from an external multi-tenant SaaS environment according to an extraction plan that specifies the extraction groups for the data objects as well as the API, extraction frequency, extraction mode, extraction parameters, and degree of parallelism for each extraction group in order to optimize the efficiency of extracting the data objects without violating the SaaS provider and customer API constraints. In certain embodiments, the method includes classifying each data object with a shape, where the shape of the data object is a function of the volume, width, field types, and field weights of the data object, and the API is assigned to each extraction group based at least in part on the shape of the data objects in the extraction group. In certain embodiments, the method further includes identifying chunks of data in a data object for purposes of data extraction, and, after extracting the data object, reconciling the chunks of data in the data object. In certain embodiments, an API is assigned to an extraction group based on a specified API, a preferred API, or a default API. In certain embodiments, an API is assigned to an extraction group using a neural network to predict the preferred API assignment for the extraction group based on one or more characteristics of the data objects in the extraction group and the scope of data extraction for the extraction group.
In one embodiment, a method for extracting large customer data volumes at high speed from an external multi-tenant SaaS environment comprises the following steps:
The present disclosure describes a system, method, and computer program for extracting large customer data volumes at high speed from an external multi-tenant SaaS environment. The method is performed by a computer system that includes servers, storage systems, networks, operating systems, and databases (“the system”).
Example implementations of the methods are described in more detail with respect to
The system creates a plurality of extraction groups, where each extraction group includes one or more data objects (step 220). The system determines an extraction frequency and an extraction mode for each of the extraction groups (step 230). In one embodiment, the system may determine extraction frequency based on the type of data being extracted. Examples of extraction modes in a backup system are full backup and incremental backups. The system determines a scope of data extraction for each of the extraction groups (step 240). Determining the scope of data extraction may include determining what types of data will be extracted for the data objects in the extraction group. In certain embodiments, determining the scope of data extraction for an extraction group includes determining whether files will be extracted from data objects in the group. For example, this may include determining whether the system will extract all non-file data, extract both data and files, extract files only, include deleted records from the data objects, etc.
The system assigns an API to each extraction group based on one or more characteristics of the data objects in the extraction group, the scope of data extraction for the extraction group, and the customer's available bandwidth for each of the APIs to the multi-tenant SaaS environment (step 250). Examples of APIs include SOAP, REST, Bulk, Bulk+, Bulkv2, etc. In certain embodiments, multiple extraction groups are associated with the same API.
The system identifies extraction parameters for each of the extraction groups (step 260). In certain embodiments, identifying extraction parameters includes identifying chunks of data in a data object for purposes of data extraction, and after the extraction step, the method further includes reconciling the chunks of data in the data object. In one embodiment, this involves determining whether to extract one or more data objects in an extraction group using chunks of data, for example, by determining whether to split or divide data objects by rows. This may be especially useful for “tall” data objects having lots of rows (e.g., greater than 1,000,000). The system then determines a degree of parallelism among the extraction groups (step 270). In addition to parallelism taking place among the extraction groups, it can also take place within an extraction group.
The system creates an extraction plan that specifies the extraction groups as well as the API, extraction frequency, extraction mode, extraction parameters, and degree of parallelism for each extraction group (step 280). In one embodiment, the system defaults to extracting up to a threshold number (e.g., 10) groups in parallel, depending on the amount of API bandwidth the customer has purchased from the SaaS provider and the customer usage of the APIs.
In certain embodiments, the method includes classifying each data object with a shape, where the shape of a data object is a function of the volume, width, field type, and field weights of the data object. In certain embodiments, the API assigned to each extraction group is based at least in part on the shape of the data objects in the extraction group. In certain embodiments, creating the extraction groups includes grouping data objects by shape and creating an extraction group for each shape grouping.
In certain embodiments, the method includes, for each extraction group, determining whether any data fields of data objects in the extraction group should be excluded from the extraction group, and, in response to determining to exclude a data field from an extraction group, excluding the data field. For example, the system may exclude all file fields from the backup group. SaaS providers often mandate that files must be extracted with the slowest available API (e.g., REST). The system may exclude files from a backup group so that a faster API may be used for the other data in the group. It may create a backup group just for files. In certain embodiments, the method further includes excluding a data field from a first extraction group that is also included in a second extraction group. In certain embodiments, the method includes excluding a data field from an extraction group in response to the data field being incompatible with the API assigned to the extraction group.
In certain embodiments, creating the plurality of extraction groups includes grouping data objects by object type and creating an extraction group for each object type. For example, creating an extraction group for each of the following: metadata, standard data objects in the SaaS environment, customer data objects, deleted data, archived data, etc. In certain embodiments, creating the extraction groups further includes creating an extraction group for a specific data type. For example, creating an extraction group for files.
In response to determining that the data type in the scope of extraction does not require a certain API, the system identifies a preferred API for the extraction group based on one or more characteristics of the extraction group (step 330). In one embodiment, the preferred API is based on the total number of rows being extracted for the extraction group. For extraction groups with larger amounts of rows, the preferred API is an API that is optimized for large data sets. If the extraction group is relatively small, then the preferred API may be a slower API with less bandwidth constraints. For example, in one embodiment, the preferred APIs are as follows: 1) less than 50,000 rows uses the REST API, which is an API that conforms to the constraints of the REST architectural style and allows for the interaction with RESTful web services, 2) between 50,000-1,000,000 rows uses the Bulk API, which conforms to REST principles, but is optimized for large data sets, and 3) more than 1,000,000 rows uses the Bulk PK Chunking. The system then determines whether the preferred API is available to use for the extraction group (step 340). In one embodiment, this depends on the plan the customer has purchased from the SaaS provider (i.e., the amount of bandwidth available to the customer for a certain API may depend on the plan they purchased) and the customer usage of the preferred API for other purposes. A customer may specify that no more than a certain percentage of bandwidth of the preferred API can be used for data extraction. This is because the customer may be primarily using the API to access the cloud-based software services. In response to determining that the preferred API is available to use for the extraction group, the system assigns the preferred API to the extraction group (step 350). In response to determining that the preferred API is not available to use for the extraction group, the system assigns the default API to the extraction group (step 360). In one embodiment, the default API for all groups is the REST API.
In certain embodiments, assigning an API to an extraction group includes using a neural network to predict a preferred API assignment for the extraction group based on one or more characteristics of the data objects in the extraction group and the scope of data extraction for the extraction group. The neural network may be trained using labeled past extraction results. In certain embodiments, one of the characteristics is the total number of rows to be extracted in the extraction group.
The backup platform 435 hosts a backup for each entity 430a, 430b, 430c. Each backup 430a, 430b, 430c includes the metadata and extracted data records that corresponds to the metadata and data records for each entity in the multi-tenant SaaS environment 450a, 450b, 450c. Both the multi-tenant SaaS environment 470 and the backup platform 435 do not provide a separate database for each entity. Hence, while the data records are illustrated separately using database symbols with respect to each entity in both the multi-tenant SaaS environment 470 and the backup platform 435, the database(s) is(are) often shared between entities on their respective servers. A customer can make API calls to the backup 430a, 430b, 430c via API interface 480.
The methods described with respect to
As will be understood by those familiar with the art, the invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. Accordingly, the above disclosure is intended to be illustrative, but not limiting, of the scope of the invention, which is set forth in the following claims.
Number | Name | Date | Kind |
---|---|---|---|
6642946 | Janes et al. | Nov 2003 | B1 |
8255320 | Seal et al. | Aug 2012 | B1 |
9268587 | Kruglick | Feb 2016 | B2 |
9769131 | Hartley et al. | Sep 2017 | B1 |
11055123 | Bin et al. | Jul 2021 | B1 |
11609774 | Bin et al. | Mar 2023 | B2 |
20080049942 | Sprunk et al. | Feb 2008 | A1 |
20120117558 | Futty et al. | May 2012 | A1 |
20120324242 | Kirsch | Dec 2012 | A1 |
20130283060 | Kulkarni et al. | Oct 2013 | A1 |
20130297769 | Chang et al. | Nov 2013 | A1 |
20140040182 | Gilder | Feb 2014 | A1 |
20140101438 | Elovici et al. | Apr 2014 | A1 |
20140143661 | Carreno-Fuentes et al. | May 2014 | A1 |
20140278534 | Romeo | Sep 2014 | A1 |
20160147999 | Fontanetta et al. | May 2016 | A1 |
20160308855 | Lacey et al. | Oct 2016 | A1 |
20170025040 | Maturana et al. | Jan 2017 | A1 |
20170048252 | Straub et al. | Feb 2017 | A1 |
20170091293 | Cummings et al. | Mar 2017 | A1 |
20170249656 | Gantner | Aug 2017 | A1 |
20180081905 | Kamath et al. | Mar 2018 | A1 |
20180089270 | Qiu et al. | Mar 2018 | A1 |
20180176117 | Gudetee et al. | Jun 2018 | A1 |
20190007206 | Surla et al. | Jan 2019 | A1 |
20190042988 | Brown | Feb 2019 | A1 |
20200067772 | Tomkins et al. | Feb 2020 | A1 |
20200082890 | Karr et al. | Mar 2020 | A1 |
20200127937 | Busick et al. | Apr 2020 | A1 |
20210049029 | Kumble et al. | Feb 2021 | A1 |
20210067324 | Valente et al. | Mar 2021 | A1 |
20220067115 | Zheng | Mar 2022 | A1 |
20220107826 | Bin et al. | Apr 2022 | A1 |
20220129804 | Dooley | Apr 2022 | A1 |
20220188334 | Chen | Jun 2022 | A1 |
20220207489 | Gupta | Jun 2022 | A1 |
20230010219 | Howley et al. | Jan 2023 | A1 |
20230082010 | Clifford et al. | Mar 2023 | A1 |
20230145349 | Watari | May 2023 | A1 |
Number | Date | Country |
---|---|---|
2022081408 | Apr 2022 | WO |