Various embodiments concern computer programs and associated computer-implemented techniques for developing and implementing algorithms to facilitate the application of machine learning.
Machine learning is the study of computer algorithms (or simply “algorithms”) that can improve automatically through experience with the use of data. These algorithms generally build a machine learning model (or simply “model”) based on sample data—also called “training data”—in order to make predictions without being explicitly programmed to do so. These algorithms are used in a wide variety of applications, and the number of possible applications continues to expand.
A core objective of machine learning is to generalize from experience. Generalization in this context is the ability of a model to perform accurately on new examples after having learned through analysis of old examples included in training data. In order to improve performance, the old examples included in the training data generally come from some probability distribution that is considered representative of the possible occurrences. Ensuring that the old examples cover a large enough gamut of the possible occurrences is an important aspect of training, as it ensures that the model is sufficiently “flexible” to accept new examples that are different than the old examples.
In theory, introducing machine learning to address a new problem or situation is a rather straightforward concept. However, appropriately designing, training, and then implementing models (and more generally, machine learning applications that make use of models) tends to be difficult in practice.
Various features of the technology described herein will become more apparent to those skilled in the art from a study of the Detailed Description in conjunction with the drawings. While certain embodiments are depicted in the drawings for the purpose of illustration, those skilled in the art will recognize that alternative embodiments may be employed without departing from the principles of the technology. The technology is amenable to various modifications.
Machine learning has become one of the most critical capabilities for modern businesses to grow and stay competitive today. From automating internal processes to optimizing the designing, creating, and marketing processes behind many products, machine learning models (“ML models” or simply “models”) and machine learning applications (“ML applications” or simply “applications”) have permeated nearly every aspect of our work and personal lives.
Development of ML models and applications tends to be iterative and complex, made even harder because most of the necessary tools are not built for the entire machine learning lifecycle.
As shown in
The interfaces 106 may be accessible via a web browser, desktop application, mobile application, or another form of computer program. For example, in embodiments where the data platform 102 resides on a computer server (e.g., that is part of a server system 110), a user may interact with the data platform 102 through interfaces displayed on a desktop computer by a web browser. As another example, in embodiments where the data platform 102 resides—at least partially—on a personal computing device (e.g., a mobile phone, tablet computer, or laptop computer), a user may interact with the data platform 102 through interfaces displayed by a mobile application or desktop application. However, these computer programs may be representative of thin clients if most of the processing is performed external to the personal computing device (e.g., on a server system 110).
Generally, the data platform 102 is either executed by a cloud computing infrastructure operated by, for example, Amazon Web Services, Google Cloud Platform, Microsoft Azure, or another provider, or provided as software that can run on dedicated hardware nodes in a data center. For example, the data platform 102 may reside on a server system 110 that comprises one or more computer servers. These computer servers can include different types of data (e.g., associated with different users), algorithms for processing the data, trained and untrained models, and other assets. Those skilled in the art will recognize that this information could also be distributed among the server system 110 and one or more personal computing devices. For example, a model may be downloaded from the server system 110 to a personal computing device, such that the model can be trained or implemented on data that resides on the personal computing device. This “localized training” may be helpful in scenarios where privacy is important, as it not only limits the likelihood of unauthorized access (e.g., because the sensitive data is not transmitted external to the personal computing device) but also limits who has access to predictions output by the model.
As further discussed below, one aspect of the data platform 102 is its ability to support machine learning workspaces (or simply “workspaces”) in which users can develop, test, train, and ultimately deploy models for building predictive applications. An application may allow use of any data under management within the “data cloud” of the corresponding user. The data cloud (also called the “data lake”) could include data stored on public cloud infrastructure, private cloud infrastructure, or both. Accordingly, the data cloud could include data that is stored on, or accessible to, the server system 110 as well as any number of personal computing devices.
Note that the workspaces may be independently accessible and manipulable by the corresponding users. For example, the data platform 102 may support a first workspace that is accessible to a first set of one or more users and a second workspace that is accessible to a second set of one or more users, and any work within the first workspace may be entirely independent of work in the second workspace. Generally, the first and second sets of users are entirely distinct. For example, these users may be part of different companies. However, the first and second sets of users could overlap in some embodiments. For example, a company could assign a first set of data scientists to the first workspace and a second set of data scientists to the second workspace, and at least one data scientist could be included in the first set and second set. Similarly, a single user may instantiate or access multiple workspaces. These workspaces may be associated with different projects, for example.
To improve the ease with which new applications can be developed, a data platform (e.g., data platform 102 of
AMPs can provide reference machine learning projects that serve as examples indicating how the corresponding models can be extended to new problems, new users, and new data. More than simplified quick starts or tutorials, AMPs represent fully developed solutions to common problems in machine learning. These solutions demonstrate how to fully use the power of the data platform. Simply put, AMPs illustrate how users can utilize the data platform to solve their own use cases through the use of machine learning, without needing to have an in-depth knowledge of machine learning. For the purpose of illustration, AMPs may be described in the context of specific problems in machine learning. However, those skilled in the art will recognize that AMPs could be developed for various problems.
AMPs may be available to install and run from a user interface (or simply “interface”) that is generated by the data platform.
Assume, for example, that a user is interested in implementing an AMP shown in
One noteworthy use for AMPs is to showcase examples that are specific to a business or field by creating specialized AMPs. After a data science project has been built using the data platform, a user can package the data science project such that it can be added to the catalog of AMPs. In some embodiments, the data science project must be reviewed and approved by an administrator before its addition to the catalog of AMPs. The administrator may be associated with (e.g., employed by) an organization that operates the data platform. In other embodiments, the data science project is reviewed and approved by the data platform. For example, the data platform may autonomously review the data science project and its characteristics (e.g., model type, model goal, modeling algorithm, accuracy) and then determine whether one or more criteria are met. If the data science project meets the criteria, then the data platform may add the data science project to the catalog of AMPs.
Each AMP may require a separate metadata file, which can define the computing resources needed by the corresponding AMP, the setup steps for installing the corresponding AMP in a workspace, etc. Exemplary code for an AMP is provided below.
Accordingly, the data structure that is representative of the AMP may include entries with respective data elements, such as name, description, version information, list of runtimes including details on dependent software versions, list of tasks to be performed by the AMP including details on code to execute, computational resources needed to implement the AMP. and the like.
At a high level, an AMP catalog (or simply “catalog”) is a collection of AMPs that can be added to a workspace as a group. Upon accessing the data platform for the first time, users may be permitted to access a default catalog that contains AMPs developed or approved by the organization that operates the data platform. However, users may also be able to create their own catalogs, adding AMPs developed by their respective organizations.
Assume, for example, that a user is interested in creating a catalog. In this scenario, the user may create a human-readable configuration file—called a “catalog file”—that can be hosted by an Internet hosting service such as GitHub, Inc. Specifically, the catalog file could be hosted on either a public server or private server. The human-readable file may be created in a data-serialization language such as YAML or JSON. The catalog file can include information about each AMP in the corresponding catalog. Moreover, the catalog file can provide a link to the repository itself. Thus, the catalog file can contain descriptive information in addition to metadata for displaying AMPs included in the corresponding catalog. Table I includes descriptions of fields that could be included in the catalog file.
For the purpose of illustration, exemplary code of a catalog file is provided below:
One benefit of this approach to maintaining catalog files for catalogs is the ability to create modifiable/editable copies of the original AMP catalog, which maintains its own distinct identity. These modifiable/editable copies may be called “forks” of existing catalogs. For example, the data platform may maintain a default catalog that is available to all users. In order to host the default catalog internally (e.g., on a personal computing device maintained by a user, her organization, etc.), a fork of the default catalog can be created. In this scenario, the uniform resource locators (“URLs”) and metadata in the forked catalog can be updated by the data platform to point to the appropriate internal resources. Thus, the data platform may tailor the forked catalog to account for its instantiation on the personal computing device.
As mentioned above, data science projects involving AMPs may include, or be associated with, metadata files that provide configuration details, setup details, and the like. For example, these details may include environment variables, as well as tasks to be run on startup. In some embodiments, the metadata file is a YAML file that has a predetermined naming structure (e.g., .project-metadata.yaml). Moreover, the metadata file may need to be placed in a specific location, for example, the root directory of the data science project, for reference purposes.
Fields for the metadata file may generally be string fields. String fields are normally constrained by a fixed character size. For example, string(64) may be constrained to contain at most 64 characters while string(200) may be constrained to contain at most 200 characters. Table II includes descriptions of fields that could be included in the metadata file.
The metadata file can optionally define any number of global environment variables for the data science project under the environment field. This field may be an object, containing keys representing the names of the environment variables and values representing details about those environment variables. Below is an example in which four environment variables are created:
AMPs might depend on some optional features of a workspace. The feature_dependencies field may accept a list of such features. Unsatisfied feature dependencies that are deemed mandatory may prevent the AMP from being launched in a workspace, and an appropriate error message may be displayed. As an example, certain model metrics may need to be defined or achieved in order for the AMP to be launched. Meanwhile, unsatisfied feature dependencies that are deemed optional may not prevent the AMP from being launched in a workspace, though the user may still be notified of the unsatisfied feature dependencies (e.g., with an appropriate warning message).
The engine_images field may accept a list of engine_image objects that are defined as follows:
This example specifies the official engine image with version 11 or 12:
Meanwhile, this example specifies the most recent version of the dataviz engine image in the workspace:
Note that when tags are not specified, the most recent version of the engine image with the matching name can be returned.
The runtimes field may accept a list of runtimes objects that are defined as follows:
The runtimes field can be defined on a per-task or per-project basis.
Meanwhile, the task list may define the tasks that can be automatically run on project import. Each task may be run sequentially in the order specified in the metadata file. Table III includes descriptions of fields that could be included in the task list.
There are various tasks that can be specified in the type field, including create job and run job.
As mentioned above, machine learning is still not fully accessible despite being impactful. Data science projects involving machine learning may not make it to production for many reasons, including limited expertise, inadequate tooling, lack of best practices, infrastructure issues, data issues, and the like. AMPs were developed to address the accessibility problem. Specifically, AMPs were designed to contribute the following:
For a machine learning use case to make it to production, several criteria must typically be met. First, the data has to be available in a scale and format that is appropriate for the use case in question. Second, data transformations, feature engineering, and model training have to be done to build a model. Third, models have to be made available to the applications that require them. Fourth, applications have to be built—properly utilizing the models—to serve specific outcomes.
AMPs target individual machine learning use cases—packing up the data, data operations, model training, model serving, and applications that make up those use cases. After an AMP has been deployed by a data platform, all of the data and code that make up the AMP may be available within a data science project for the necessary work to incorporate user-specific data, as well as enable further customization. Said another way, the data and code that make up the AMP may be readily manipulable or extendable to accommodate user-specific data that is provided as input. To facilitate implementation, AMPs may be available through a catalog as discussed above. The catalog can be updated as new AMPs are developed and made publicly available to users of the data platform. Moreover, users may be able to develop their own AMPs, for example, to reflect organizational best practices or address organizational specificities. Through the creation of a customized catalog, a user may be able to make these AMPs available to other users associated with the same organization. Over time, it is expected that the number of AMPs will continue to grow.
When implemented through the data platform, AMPs have two main components as shown in
Advantageously, the data platform may allow AMPs to be launched from a workspace in several ways. First, a user may launch an AMP from the catalog by selecting an “AMP tile,” clicking the digital element labeled “Launch as Project,” and then clicking the digital element labeled “Configure Project,” as shown in
Launching an AMP causes the data platform to perform several steps “under the hood.” Specifically, the data platform can clone the repository that corresponds to the AMP, check for a metadata file in the root of the repository, and then initiate an automatic execution of the steps specified in the metadata file to create the data, models, and applications necessary to recreate the data science project. Each step may correspond to a job, session, experiment, model endpoint, or application that is executable or implementable by the data platform.
In
After receiving input indicative of a confirmation of the parameters (e.g., a selection of the digital element labeled “Launch”), the data platform can initiate construction of the data science project. Said another way, the data platform can begin building the data, models, and applications needed for the data science project using the AMP assets maintained in the repository. As shown in
Thereafter, the data platform can configure an AMP that serves as a repository that includes code and information, if any, that is needed to programmatically reproduce another instance of the data science project in such a manner that the machine learning model is extendable to a different user or a different dataset (step 1202). At a high level, the data platform may genericize aspects of the data science project, so that its underlying mechanisms—namely, its code and machine learning model—can be applied in a different context. As mentioned above, in some embodiments, the data platform only configures the AMP in response to receiving approval to do so. Thus, the data platform may receive second input that is indicative of an approval, by an administrator, of the data science project, and the data platform may configure the AMP in response to receiving the second input.
The data platform can then add the AMP to a catalog by populating the repository into a data structure that corresponds to the catalog, so as to make the AMP accessible to another user for implementation as part of another data science project (step 1203). In some embodiments, the AMP is only made available to other users that are part of the same organization (e.g., company) as the user that developed the data science project. In other embodiments, the AMP is made available to all users of the data platform. As mentioned above, the catalog may include multiple AMPs that users are permitted to deploy. Each of the multiple AMPs may be associated with a different repository, and therefore the repository may be one of multiple repositories maintained in the data structure. In the data structure, each of the multiple AMPs may be accompanied by a metadata file that defines an operational characteristic of the corresponding AMP. For example, a metadata file may specify the computing resources needed by the corresponding AMP and/or setup sets for installing the corresponding AMP.
At some point thereafter, the data platform may receive second input that is indicative of a selection, by a second user, of the AMP from among the multiple AMPs (step 1204). For example, the second user may select the AMP through an interface such as the one shown in
Then, the data platform can initiate automatic execution of one or more steps specified in the metadata file to recreate the data science project in such a manner that the machine learning model is applicable to user-specific data (step 1304). For example, the data platform may cause digital presentation of the information included in the metadata file on an interface and, in response to receiving second input that is indicative of a confirmation, by the user, of the information, construct a new instance of the data science project using assets included in the copy of the repository. The assets could include the code and information that is needed to programmatically recreate the new instance of the data science project. Moreover, the data platform may determine whether alteration of the machine learning model is necessary for the new instance of the data science project to be suitable for analysis of the user-specific data. In the event that the data platform determines that an alteration of the machine learning model is necessary, the data platform can implement the alteration on behalf of the user. In some embodiments, the data platform keeps the user apprised of progress by causing digital presentation of an indicium that visually illustrates progression as the new instance of the data science project is being constructed.
The data platform can then cause the human-readable configuration file to be stored on a computer server (step 1405). This may result in the human-readable configuration file to be accessible to other users who are members of the same group as the user. For example, the human-readable configuration file could be made available to all users of the data platform, or the human-readable configuration file could be made available to other users who are employees of the same organization as the user. In some embodiments, the computer server is a public computer server that is part of the same server system on which the data platform resides. In other embodiments, the computer server is a private computer server, for example, that is maintained by, or accessible to, an organization of which the user is an employee.
As mentioned above, this approach to creating a configuration file that includes either references (e.g., links) to repositories of AMPs (or the repositories themselves) allows the catalog to be readily forked. The data platform may fork the configuration file, thereby creating a new catalog, in response to receiving input that is indicative of a selection, by a second user, of the same multiple AMPs from amongst the collection of AMPs. Similarly, the data platform may fork the configuration file, thereby creating a new catalog, in response to receiving input that is indicative of a selection, by the second user, of the catalog itself. Once forked, the new catalog may be readily editable, for example, by allowing the second user to add new AMPs thereto or delete existing AMPs therefrom.
The processing system 1500 may include a processor 1502, main memory 1506, non-volatile memory 1510, network adapter 1512, display mechanism 1518, input/output device 1520, control device 1522, drive unit 1524 including a storage medium 1526, or signal generation device 1530 that are communicatively connected to a bus 1516. Different combinations of these components may be present depending on the nature of the computing device in which the processing system 1500 resides. The bus 1516 is illustrated as an abstraction that represents one or more physical buses or point-to-point connections that are connected by appropriate bridges, adapters, or controllers. Thus, the bus 1516 can include a system bus, a Peripheral Component Interconnect (“PCI”) bus or PCI-Express bus, a HyperTransport or industry standard architecture (“ISA”) bus, a small computer system interface (“SCSI”) bus, a universal serial bus (“USB”), inter-integrated circuit (“I2C”) bus, or an Institute of Electrical and Electronics Engineers (“IEEE”) standard 1394 bus (also called “Firewire”).
While the main memory 1506, non-volatile memory 1510, and storage medium 1526 are shown to be a single medium, the terms “machine-readable medium” and “storage medium” should be taken to include a single medium or multiple media (e.g., a centralized or distributed database and associated caches and computer servers) that store one or more sets of instructions 1528. The terms “machine-readable medium” and “storage medium” shall also be taken to include any medium that is capable of storing, encoding, or carrying instructions for execution by the processing system 1500.
In general, the routines executed to implement embodiments of the present disclosure may be implemented as part of an operating system or a specific computer program. A computer program typically comprises instructions (e.g., instructions 1504, 1508, 1528) set at various times in various memory and storage devices in a computing device. When read and executed by the processor 1502, the instructions cause the processing system 1500 to perform operations in accordance with aspects of the present disclosure.
Further examples of machine- and computer-readable media include recordable-type media, such as volatile memory devices and non-volatile memory devices 1510, removable disks, hard disk drives, and optical disks (e.g., Compact Disk Read-Only Memory (“CD-ROMs”) and Digital Versatile Disks (“DVDs”)), and transmission-type media, such as digital and analog communication links.
The network adapter 1512 enables the processing system 1500 to mediate data in a network 1514 with an entity that is external to the processing system 1500 through any communication protocol supported by the processing system 1500 and the external entity. The network adapter 1512 can include a network adaptor card, a wireless network interface card, a router, an access point, a wireless router, a switch, a multilayer switch, a protocol converter, a gateway, a bridge, bridge router, a hub, a digital media receiver, a repeater, or any combination thereof.
The foregoing description of various embodiments of the claimed subject matter has been provided for the purposes of illustration and description. It is not intended to be exhaustive or to limit the claimed subject matter to the precise forms disclosed. Many modifications and variations will be apparent to one skilled in the art. Embodiments were chosen and described in order to best describe the principles of the invention and its practical applications, thereby enabling those skilled in the relevant art to understand the claimed subject matter, the various embodiments, and the various modifications that are suited to the particular uses contemplated.
Although the Detailed Description describes certain embodiments and the best mode contemplated, the technology can be practiced in many ways no matter how detailed the Detailed Description appears. Embodiments can vary considerably in their implementation details, while still being encompassed by the specification. Particular terminology used when describing certain features or aspects of various embodiments should not be taken to imply that the terminology is being redefined herein to be restricted to any specific characteristics, features, or aspects of the technology with which that terminology is associated. In general, the terms used in the following claims should not be construed to limit the technology to the specific embodiments disclosed in the specification, unless those terms are explicitly defined herein. Accordingly, the actual scope of the technology encompasses not only the disclosed embodiments, but also all equivalent ways of practicing or implementing the embodiments.
The language used in the specification has been principally selected for readability and instructional purposes. It may not have been selected to delineate or circumscribe the subject matter. It is therefore intended that the scope of the technology be limited not by this Detailed Description, but rather by any claims that issue on an application based hereon. Accordingly, the disclosure of various embodiments is intended to be illustrative, but not limiting, of the scope of the technology as set forth in the following claims.
This application claims priority to U.S. Provisional Application No. 63/313,611, titled “Applied Machine Learning Prototypes (AMPs) for Hybrid Cloud Data Platform” and filed Feb. 24, 2022, which is incorporated by reference herein in its entirety.
Number | Date | Country | |
---|---|---|---|
63313611 | Feb 2022 | US |