The present disclosure relates to methods, techniques, and systems for annotation of data and, in particular, to methods, techniques, and systems for automatically annotating open data and generating a taxonomy for a set of datasets.
The word “taxonomy” has been traditionally associated with the classification of living organisms into different scientific classifications; however, in recent times, it has also been applied to the general process of sorting, classifying, or categorizing “things” into groups. Thus, other schemes of classification for things such as genomes, smells, computer systems, websites, political identities, and the like, have been developed over time. Taxonomies are typically formulated by brute force (by hand) and a priori, before reviewing the data that is to be classified. Thus a taxonomy may organize data displayed on a website for example, by categorizing the data of a website into groups.
As new types and sources of data become available, it becomes increasingly important to development mechanisms that aid in characterizing that data automatically if not semi-automatically. Proper characterizations allow end-users to browse the data and discover information rather than search based upon known keywords.
Embodiments described herein are directed to tools that facilitate the use and sharing of “open data.” Open data as used herein refers to any data that anyone is free to use, reuse, and redistribute—subject at most to a requirement to attribute and share-alike (share the results of open data mixed with other data). This data is typically supplied as a “dataset,” which refers to the metadata that describes the information in the dataset as well as the data itself. The data may be arbitrary but is typically provided in binary code or a human-readable text file with a pre-defined format.
In particular, the environment or platform referred to herein allows end-users to browse open data via two major approaches: the curated, top-down approach and the bottom-up, organic search approach. The bottom-up, organic search approach operates using keywords to locate relevant datasets and is not discussed further herein. The curated, or semi-curated top-down approach encourages and facilitates the generation of computer driven tools that can guide end-users interested in browsing or pursuing specific topics or classifications. End-users using the curated or semi-curated approach may or may not know in advance what they are looking for—hence they may not be able to use a bottom-up approach that uses keywords the end-user is presumed to be aware of.
The topics/classifications used by the top-down approach may take the form of “annotations” (tags, concept associations, topic associations, classifications, and the like) associated with particular datasets. For example, a dataset may or may not be about “animals” or about “crime.” A dataset may be annotated, for example, by associating a tag representing the topic/classification with the metadata of the dataset, or by otherwise associating the tag with the dataset such as through an external mechanism such as a data repository that indexes datasets and their annotations (and potentially other useful information).
In order for the curated approach to operate meaningfully and effectively, each dataset needs to be tagged with the annotations that describe the data. However, there is no known taxonomy (e.g., categorization, grouping, classification, or the like) for describing datasets (of potentially arbitrary data) comparable to the biological classifications of living organisms such as plants. Moreover, definitions of categories/taxonomies of data may change as content from different sources with different underlying definitions and structure is incorporated into a corpus of data over time. For example, a “crime” from a police department that is part of the governance of a state may have different meaning (e.g., set by statute) than a “crime” from a neighborhood community center (which may have a broader view of what constitutes a crime, for example).
In addition, a publisher that provides data to the platform (herein referred to as a data publisher or content publisher) may define internally what their data and initially provided annotations mean, typically by providing metadata or information with the data that describes the data such as a “name” and “description.” Even so, customers of the data (end-users) may not ascribe the same definitions to these annotations or data descriptions. Opening up data (for use as “open data”) means that the top-down approach has to be disentangled from the content publishers (and content creators, if different) to solve the overall problem of providing meaningful global annotations for the search and discovery of content by the end-user.
Accordingly, in order to provide global annotations that are resilient for all or most datasets and that expand over time to fit new datasets as the system grows (with potentially a limitless number of datasets), it becomes desirable to provide an architecture for discovering annotations, building predictive models that can be used to automatically apply these annotations to datasets (new or old) with a desired level of precision, and scaling out the models as new models are developed and new annotations are added to the system.
Embodiments described herein provide enhanced computer- and network-based methods, techniques, and systems for providing a scalable architecture to annotate datasets. Example embodiments provide a Scalable Annotation System (“SAS”), which enables data platforms, for example, open data platforms (and other environments) to generate (e.g., develop, create, design, or the like) predictive models for annotating data in a cost efficient manner using an iterative process. The iterative process makes an educated “guess” on training data for a model for an annotation (e.g., a tag, topic, classification, grouping, etc.) and improves that model over time (over multiple iterations) to a tunable precision by sending select subsets of the data to one or more crowdsourcing venues (or human labeling by other means) for verification. This process is repeated in a “feedback” loop until the desired precision of the training/test data for the model is reached. For example, a model for the annotation “animal” is going to have a set of discriminating words, phrases, or descriptors (features) that, if present in the dataset, are likely to indicate that the dataset is about an animal and if not present in the dataset are likely to indicate that the dataset is not about an animal. The model is then used to annotate all of the datasets. In one embodiment of the SAS, the generated predictive models are Support Vector Machines, which are non-probabilistic binary classifiers. However, other machine learning techniques may be similarly incorporated or substituted such as, but not limited to, naïve Bayesian networks, neural networks, decision trees, and the like.
One of the advantages of SAS is that each derived model is independent. Thus, it is possible to create and apply more than one model per annotation, and different models for different annotations, and apply them to the datasets without concern for the models interfering with each other. In addition, models that are not performing well may be tossed out without breaking the system and new models, for example for new annotations, added at any time. This provides a degree of scalability to the system as the datasets grow over time and a taxonomy of possible annotations is developed.
As mentioned, during the feedback loop used to create a model for an annotation, a subset of the datasets (a subset of test/training data) is send to a crowdsourcing venue for verification and to improve the accuracy of the training data. Here a crowdsourcing venue refers to any organization, institution, application, or API that supports the performance of (typically small) divided tasks to be performed by humans via, for example, an open call or invitation for work. For example, crowdsourcing has been used to identify and group images or to look for a particular image (e.g., a needle in hay stack problem). AMAZON MECHANICAL TURK is one example of such a program, although others may be incorporated. In the case of SAS, the crowdsourcing venue is sent a “survey” (e.g., questionnaire, work request) of one or more questions to be answered for one or more datasets, where each question refers to a (row of a) dataset and the dataset's metadata and requests the recipient to indicate whether the annotation describes that data. For example, the survey may ask a question such as: is this (row of the) dataset (and metadata presented) about animals? Surveys are described further below with reference to
Several non-trivial issues confront the SAS in the development of predictive models for annotating the datasets: namely,
1. No training data exists a priori for a given annotation so this initial data must be developed;
2. The set of annotations is unknown and may (likely) evolve over time; and
3. Money is a finite resource.
If the third premise were not true, then arguably every dataset could be sent en masse to a crowdsourcing venue and individually annotated by humans. However, as explained further below, using crowdsourcing (or other human labeling means) to hand label all datasets for a platform would be cost prohibitive for most platforms including the SAS.
Thus, to address these issues, the SAS combines a process of initially creating training data using a seeding algorithm and creates an initial “ansatz” (“guess”) predictive model which is augmented over time using subsets of human verified data to generate better accuracy.
In particular, in block 101, a desired annotation is input into the system (e.g. “animal”) and an indication of a termination condition is expressed. For example, a termination condition may be expressed as the number of desired positive and negative dataset examples (of the annotation), a desired or maximum budget to spend on crowdsourcing augmentation, some mixture of both, or some other termination condition. Blocks 102 through 107 are then repeated iteratively to collect/create test (positive and/or negative) datasets for the model until this termination condition is reached. Specifically, block 102 uses a seed word or phrase, which is presumed to be predictive of the annotation, to create training data with a certain (settable) number of positive examples and negative examples. Typically, this seed word or phrase is a keyword known to result in a positive result. For example, if the desired annotation is “animal,” then the datasets are initially searched for a seed word such as “dog,” “cat,” or “animal shelter.” In some example ATDCS systems, the initial keywords are human provided (socially provided) tags; in other examples, the initial keyword is machine provided. If the seed word/phrase is found, then the dataset is deemed a positive dataset. If the seed word is not found and insufficient positive datasets have been located in this seed round, then the logic endeavors to find other keywords that result in positive dataset examples. (For example, using keywords immediately connected to the current keyword in a taxonomy of keywords known thus far to the system.) At some point, the logic determines a set of keywords that can be used to determine negative examples as well. Examples of specific logic for making these determinations are described with reference to
The seed labeled datasets are stored by logic 107 in a data repository for storing the training dataset information, such as a model database (database containing information useful to the models being built by the system). In block 103, the logic creates an (initial) ansatz model using the current training dataset information and in block 104 uses this ansatz model to predict whether the annotation applies to all of the remaining datasets. (This information results can also be stored in the model database.) Recall that the ansatz model, in one embodiment, is also an SVM (Support Vector Model) which can classify all of the remaining datasets even though it truly is a “guess”—some results will be correct and others not. These processes are described further below in the “Ansatz/Seed Model” section. Block 105 samples output (some datasets) from those annotated by the ansatz model and send representations of these datasets off (as surveys) in block 106 to be hand labeled, such as by a crowdsourcing venue or other hand labeling process. The sampling process and crowdsourcing survey process is described further below in the “CrowdSourcing” section. The results of the human labeled data are stored by logic 107 in the data repository for storing the training dataset information (e.g., the model database). Specifically, in one example SAS, the human labeled data stored by logic 107 are “preferred” over the seeded results and thus supplemented by data derived from the ansatz model to reach the desired number of positive and negative examples.
Then, either the logic returns to the beginning of the loop in block 102 if the termination condition is not yet met (e.g. not enough positive and/or negative dataset examples with sufficient precision), otherwise then in block 108, the stored and labeled training datasets are used to construct a predictive model for a particular annotation, which is used to annotate all of the datasets in the corpus. More specifically, feature extraction is applied to each remaining dataset and the constructed predictive model is applied to the extracted features of each remaining dataset to determine whether that particular remaining dataset should be annotated with the particular annotation that is the subject of the constructed model.
In one example SAS, annotating a dataset may comprise associating an annotation, for example a word or phrase tag, with the metadata for a dataset in a data repository. Other example SASes may comprise other methods for annotating the dataset. For example, an index may be created that cross references annotations to dataset descriptors. Annotations are typically in the form of words or phrases and may also contain any alpha numeric sequence. Also, other forms of annotations may be contemplated, such as annotations that contain audio, video, or static image data instead of or in addition to words or phrase tags. Other implementations are similarly contemplated.
The logic of the annotations training data collection system (ATDCS) of the SAS is also shown as pseudo-code in Table 1 below. This logic is similar to the logic described with reference to
For most platforms, cost is a big issue. Therefore, it is desirable to automate some process for finding training data because it is not practical or possible to have humans hand annotate every dataset—it is too costly and time consuming. When a dataset is loaded into the platform, there is typically some, but not necessarily a lot of, metadata used to describe the individual dataset and the metadata may be severely incomplete or “incorrect” from a global (across all datasets) perspective. The types of datasets that are loaded also may lend themselves to many different topics/categories and potential annotations. Thus, it becomes imperative to allocate money wisely when the number of datasets reaches the tens of thousands with potentially hundreds (or more) of annotations.
Industrial jargon generally evolves and as does the dataset content over the number and types of datasets available on the platform and hence what different annotations mean. There may not be initially enough information to annotate all datasets (e.g., some categories do not apply to the all of the datasets and some datasets may not yet be associated with categories that have been “discovered” as part of the taxonomy being developed). Still, enough features and predictive model(s) need to be made available in order to automatically suggest annotations when confronted with a new dataset.
Hypothetically, the simplest path for the SAS to gather data to train a new model would be to send all (or some large random sample) of the metadata to a crowdsourcing venue. However, to send all of the datasets to the crowdsourcing venue would be cost prohibitive for most platforms, as shall be seen from the computations below.
For simplicity of example, assume that a survey (for a single dataset) has one question, that is, whether a particular annotation applies, and that this question can be answered by “yes,” “no,” “unsure.” In this example, each question applies to a single annotation—thus there is a survey per annotation per dataset.
Further suppose that the SAS desires 100 datasets as positive training examples and 100 datasets as negative training examples for a model for the annotation “animal.” Also assume for the sake of discussion that it may take at least 500 surveys (using the assignment of 1 annotation per dataset in a survey) to find the needed training examples as long as there are not a lot of “unsure” answers selected. Assume that 7 people respond to take these (500) surveys and that the cost from a crowdsourcing venue is $0.025 per survey. The cost of obtaining training data for a single model for a single annotation can then be expressed as:
Cost for training data=# people responding*# surveys*cost per survey (1)
or 7*500*0.025=$87.5 per model (per iteration). Based upon average statistics of running the SAS over a period of time, a minimum of 7 people is typically needed (to determine a yes/no answer) with an average of 10-18 people—or $125-$225 per iteration for building a model for a single annotation. If it takes 10,000 iterations to collect training data for 1000 annotations, the cost balloons to 90,000 for 18 people.
Now suppose instead that it takes 10,000 datasets (e.g. all of the datasets are sent to the crowdsourcing venue) to obtain training data (positive and negative examples) for a model to assign the single category “Politics and Government” to a dataset, then the cost balloons to 10,000*0.025*18=$4,500 per annotation. If the system is assigning 1000 annotations (10,000 datasets*1000 annotations), the cost balloons to 4,500,000. Thus, it is clearly more cost effective to limit the number of surveys to be sent to a crowdsourcing venue rather than have the crowd evaluate each annotation for each dataset.
Some control of cost can be obtained as well by including the assignment of more than one annotation in a single survey.
As described above with respect to
Terminating conditions may be expressed in many different forms. For example, a terminating condition may be expressed as the number of positive dataset examples desired, the number of negative dataset examples desired, or both. Alternatively or additionally, the terminating condition may be expressed as a budget that is spendable on the crowdsourcing venue, and when this budget is met or exceeded, the ATDCS terminates its attempt to discover a model for that particular annotation. As another example, the terminating condition may be expressed as the number of iterations of the feedback loop before declaring that the loop is unable to build a model for that annotation based upon the given seed. Other terminating conditions can be incorporated as well.
In order to start the iterative process of formulating training data for building a predictive model (for a designated annotation), a seed word, multiple words, or phrase is needed that is a strong predictor of the designated annotation. A seed algorithm is then used to find positive examples of datasets that can be categorized with the designated annotation and negative examples of datasets that should not be categorized with the designated annotation. Many different seeding algorithms can be used for this purpose, including a breadth first search of an index of words contained in the corpus starting from an initial keyword/phrase (the seed), potentially picked by a user.
As another example, if the annotation is “politics,” the seed word that can be used as a strong predictor of whether or not to apply the annotation is the word “politics.” The seeding algorithm can then perform an initial search to find datasets with the word “politics” as a feature of the dataset. When the seeding algorithm reaches the number of positive and negative datasets it desires to find, it can simply terminate. In one embodiment, the default number of positive and negative dataset examples are 100 each and the algorithm looks for negative examples after the positive ones are found. Other embodiments may have other defaults and/or the number may be a selectable or tunable parameter.
Other techniques for determining an initial set of positive examples of a designated annotation and negative examples of a designated annotation may be used. For example, the ATDCS itself may run an algorithm on the entire corpus, such as a Latent Dirichlet Allocation, to initially designate annotations for the set of datasets. Alternatively, the ATDCS may run an indexed search on the datasets of a portion or all of the corpus, using for example a search engine like Lucene, to generate an index of words (a keyword graph that relates keywords to other keywords), and choose how far removed a word is from the desired seed using a “k-nearest neighbor like” approach before one considers the annotation to not be accurate. For example, the ATDCS can perform a breadth first search of the index to determine (as an example) all keywords immediately connected to a chosen seed keyword (connected by a single edge) are to be used to search for positive examples of the seed. In addition, the ATDCS can then determine (as an example) that all keywords separated by the seed keyword and its immediately connected keywords by at least one level of indirection are to be used to search for negative examples of the seed word. (Other degrees of separation may be used.) Although the results of whether the dataset should be annotated with the designated annotation may not be accurate, the feedback loop will cause these dataset assignments to self-correct by nature of the crowdsourced supplementation.
Notice that these are mere assumptions used to create an initial ansatz model using a seeding algorithm—they may not be correct. For example, a dataset containing the phrase “animal abuse statistics” may be eventually be considered by the feedback loop to in fact be a dataset that should be annotated with the category “animal” even though the initial seeding algorithm did not find this so. Choosing the right distance (number of edges) from the topic keyword as discriminating between a positive and negative example may influence initially how precise the initial model created from the seed is; however, since the ATDCS iterates and refreshes the training datasets with positive examples that are human verified, this initial distance has less permanent effect than might otherwise be the case.
Specifically, the example seeding algorithm described (e.g., the breadth first algorithm) maximizes exposure to discriminating features for analysis. The positive examples are taken as those closest to the seed feature, negative examples start when the positive examples are complete. The feedback mechanism of the crowdsourcing naturally adjusts the starting points for where to look for positive and negative examples because the seed features are known when the ATDCS goes back to begin a subsequent iteration of the loop (block 102) AFTER crowdsourcing has been used to validate some positive training examples. In this case, the training data is biased to include the examples validated and the initial seed based examples fill in the remaining number of desired examples. (They are stored in the model database for future reference.) For example, if the first pass through crowdsourcing yielded 40 positive examples and 60 negative examples, the seeding algorithm performed by the ATDCS in the next round will only use 60 positive examples and 40 negative examples by default. So, with subsequent iterations, fewer seed based datasets are selected.
Once the initial positive and negative dataset examples are selected, then the ATDCS creates an “ansatz” predictive model (e.g. in block 103 of
Of note, when creating a “model” for an annotation, strong predictive power can be realized by finding features (words, phrases, descriptors, and the like) that overlap. Features can be said to overlap when they co-occur when a particular topic is present. That is, if all datasets that can be annotated with the topic “animal” always also contained the feature “dog,” then co-occurrence of the words “dog” and “animal” would be highly predictive for classifying future datasets. It is also the case, however, that having too many predictive features makes prediction of classification more difficult. In this case it may be difficult to determine whether a dataset that contains many but not all of the overlapping features for topic “A” that are also many but not all of the overlapping features for topic “B” should be classified as an “A” or a “B.” Accordingly, it is important to have sufficient features for their discriminating power, yet not too many to cause too much overlap.
One way to solve this issue is to employ a principle component analysis to reduce features so that only the more important features are used as predictive features. Some machine learning techniques, as here, implement a principle component analysis implicitly as part of their implementation. A separate principle component analysis may also be incorporated. The end result—a good set of distinguishing features that allow strong discriminants for determining whether a model applies to a given dataset.
After the ansatz model has been applied to the corpus, the ATDCS determines a sample of datasets that are to be verified by crowdsourcing (e.g. in block 105 of
In order to send dataset (and corresponding metadata) samples to the crowdsourcing venue, a survey is created for the selected sample of datasets. There are many methods for formulating such surveys, some by automated methods as described with reference to
In addition to generating a survey for an annotation, the ATDCS determines which sample datasets (and metadata) to send for crowdsourcing validation. This sampling process may be performed in a variety of manners. Because the validation and collection of training data is an iterative process, it is possible to sample the datasets in a “biased” manner in order to elicit principal words (features) indicative of the designated annotation. One such sampling method, referred to as “reactive sampling,” is used herein although other sampling methods can be similarly incorporated.
Specifically, with reactive sampling, depending upon whether more positive or negative example datasets are needed, the ATDCS can choose datasets (as samples for crowdsourcing) that have been annotated using the ansatz predictive model with probabilities that indicate that they are “clearly in” (likely positive examples) or that indicate that they have a high co-occurrence of multiple words (features) and thus may be good discriminators once hand labeled by crowdsourcers to designate whether the annotation applies or does not apply (maybe positive examples).
More specifically, when using the predictive ansatz model (in block 104 of
In one example embodiment, the sampling logic of the ATDCS (the “sampler”) reads from the model database and determines where it is missing information. For example, if the ATDCS (as may be evident from the terminating conditions) need more positive examples, the sampler selects datasets that whose probabilities outputted by the ansatz model are above the threshold for a yes/no decision (those well above 50%). In contrast, if the ATDCS determines that the model database has more positive examples than it needs (as may be evident from the terminating conditions) and not enough negative examples, the sampler looks closer at the datasets with probabilities outputted by the ansatz model that are closer to the threshold for a yes/no decision (for example, at the 50% probability level). By looking at the threshold, the sampler is finding datasets that have the highest co-occurrence of features and is depending upon the crowdsourcing to help differentiate these features.
Once the samples have been chosen, an automated survey can be generated (for the sampled datasets) and sent to the crowdsourcing venue, for example, using available APIs (application programming interfaces). The ATDCS will indicate how many recipients are required, with 7 being used as a default in one example ATDCS. There are multiple styles for the crowdsourcing survey question that can be chosen by the ATDCS user/administrator (as a configurable parameter) as described with reference to
The number of survey recipients is also configurable. A default value of 7 survey recipients was chosen in one example ATDCS because it leads to a clear outcome when the outcome is obvious, and is somewhat robust to systematic bias, like crowdsource users just clicking answers at random. It is large enough to capture potential disagreements among those surveyed, but not so large as to expend resources unnecessarily. The example ATDCS uses standard binomial confidence intervals to decide which surveys have obvious yes/no answers. Binomial confidence intervals are discussed in C. J. Clopper and E. S. Pearson, “The Use of Confidence or Fiducial Limits Illustrated in the Case of the Binomial,” Biometrika 26:404-413, 1934, which is incorporated herein by reference.
The default crowdsourcing survey question (automatically generated using the template) is optimized to search for a binary result for a desired set of annotations. When the results of a surveyed dataset do not yield a clear binary result with a sufficient confidence interval, then this survey is sent back to the crowdsourcing venue to extend the number of recipients answering the survey. For example, the number may be increased from 7 to 17 to gain accuracy in the results for a surveyed dataset.
Thus, in the bulk of cases in this initial pass, the ATDCS can easily discern surveys where the answer is simply yes or no. The red oval circle 403 indicates a set of the survey questions (for datasets) where there is no clear yes/no, around the center of the plot where the bands are above or close to 0.5. In one example ATDCS, these survey questions are sent back to the crowdsourcing venue to extend the number of survey recipients for these surveys to elicit a stronger yes/no response. The result of sending these survey questions to a larger crowdsourcing audience is shown in
These statistical methods can be incorporated into the ATDCS to minimize cost without a loss of accuracy by minimizing necessary exposure to crowdsourcing. Had a larger sample size been used at the beginning, the cost would be approximately doubled. In addition, most annotations apply to less than 1% of the total data—thus using a brute force approach likely would be wasteful as much of the information gained would not apply.
As described with reference to
In one example ATDCS, this is performed by building a Support Vector Model (SVM). The SVM is a machine learning program, that can be stored in a serialized data format for example, in a standard file, and loaded when it is to be executed. The training set data to be used as classifiers for the machine learning program has already been stored in the model dataset as a result of the previous training data generation. When a new dataset is to be annotated, the (SVM) model is loaded and executed using its training data as classifiers to determine whether or not to annotate the new dataset with a particular annotation.
In an example Scalable Annotation System described there is at least one predictive model generated for each annotation. In addition, there may be more than one model (with its training data—classifiers) per annotation. Some of these models may turn out to be better predictors than others. Thus, the system of models can be scaled (and trimmed) as the datasets and annotations grow.
Usually when forming a hierarchy of models, the best will be chosen according to a decision tree. Here, the SAS uses an array of models, each with a binary outcome, and applies the annotations to the datasets independently per model, where each model has its own features and its own test/train set for creation. A model manager receives in the data (a dataset) for annotation, feeds the input into each model separately, and returns back indicators to the models that returned a “yes” when asked the binary question, “is this dataset about X.” A list of annotations with p-values (probability values) is returned that determine how likely a particular annotation in the list is accurate.
Using this procedure, any model can be injected into the system and that model's performance will not affect the performance of any other annotation model because each is built and run separately. For example, if a new model is created about “flying monkeys,” its performance in production will not affect models already in use to create automated annotations for “politics,” “evil badgers,” and “food trucks” or even other models already in use to create automated annotations for “flying monkeys.”
As more data is gathered, the SAS can refine and version models independently so that any underlying API is not affected. In addition, issues regarding model boosting, multiple annotations, or customizing categories can be addressed separately while the models are in production.
In the examples described here, the ATDCS 511 further comprises logic 512, which operates within the ATDCS 511 to generate predictive models for each desired annotation according to the block diagram described with reference to
All and each components of the SAS may be implemented using one or more computing systems or environments as described with respect to
Although the techniques of SAS and the ATDCS are applicable to open data, they are generally applicable to any type of data, especially large amounts accessible over a network. In addition, the concepts and techniques described are applicable to other annotation needs where the annotations are not known a priori. Essentially, the concepts and techniques described are applicable to annotating any large corpus of data.
Also, although certain terms are used primarily herein, other terms could be used interchangeably to yield equivalent embodiments and examples. In addition, terms may have alternate spellings which may or may not be explicitly mentioned, and all such variations of terms are intended to be included.
In addition, in the description contained herein, numerous specific details are set forth, such as data formats and code sequences, etc., in order to provide a thorough understanding of the described techniques. The embodiments described also can be practiced without some of the specific details described herein, or with other specific details, such as changes with respect to the ordering of the logic, different logic, etc. Thus, the scope of the techniques and/or functions described are not limited by the particular order, selection, or decomposition of aspects described with reference to any particular routine, module, component, and the like.
The computing system 600 may comprise one or more server and/or client computing systems and may span distributed locations. In addition, each block shown may represent one or more such blocks as appropriate to a specific embodiment or may be combined with other blocks. Moreover, the various blocks of the SAS 610 may physically reside on one or more machines, which use standard (e.g., TCP/IP) or proprietary interprocess communication mechanisms to communicate with each other.
In the embodiment shown, computer system 600 comprises a computer memory (“memory”) 601, a display 602, one or more Central Processing Units (“CPU”) 603, Input/Output devices 604 (e.g., keyboard, mouse, CRT or LCD display, etc.), other computer-readable media 605, and one or more network connections 606. The SAS 610 is shown residing in memory 601. In other embodiments, some portion of the contents, some of, or all of the components of the SAS 610 may be stored on and/or transmitted over the other computer-readable media 605. The components of the SAS 610 preferably execute on one or more CPUs 603 and manage the generation and use of predictive models for annotation, as described herein. Other code or programs 630, a platform 631 for manipulating open data, the open data itself 620, and potentially other data repositories also reside in the memory 601, and preferably execute on one or more CPUs 603. Some of these programs and/or data may be stored in memory of one or more computing systems communicatively attached such as by network 660. In some embodiments, for example where the SAS operates as a specific machine, the SAS may comprise its own processor 614 (one or more) and an API 617 for access to the various data and models. Of note, one or more of the components in
In a typical SAS environment, the SAS 610 includes the components described with reference to
In an example embodiment, components/modules of the SAS 610 are implemented using standard programming techniques. For example, the SAS 610 may be implemented as a “native” executable running on the CPU 603, along with one or more static or dynamic libraries. In other embodiments, the SAS 610 may be implemented as instructions processed by a virtual machine. A range of programming languages known in the art may be employed for implementing such example embodiments, including representative implementations of various programming language paradigms, including but not limited to, object-oriented, functional, procedural, scripting, and declarative.
The embodiments described above may also use well-known or proprietary, synchronous or asynchronous client-server computing techniques. Also, the various components may be implemented using more monolithic programming techniques, for example, as an executable running on a single CPU computer system, or alternatively decomposed using a variety of structuring techniques known in the art, including but not limited to, multiprogramming, multithreading, client-server, or peer-to-peer, running on one or more computer systems each having one or more CPUs. Some embodiments may execute concurrently and asynchronously and communicate using message passing techniques. Equivalent synchronous embodiments are also supported.
In addition, programming interfaces to the data stored as part of the SAS 610 (e.g., in the data repositories 615 and 616) including the predictive models can be available by standard mechanisms such as through C, C++, C#, and Java APIs (e.g. API 617); libraries for accessing files, databases, or other data repositories; through scripting languages such as XML; or through Web servers, FTP servers, or other types of servers providing access to stored data. The data repositories 615 and 616 may be implemented as one or more database systems, file systems, or any other technique for storing such information, or any combination of the above, including implementations using distributed computing techniques.
Also the example SAS 610 may be implemented in a distributed environment comprising multiple, even heterogeneous, computer systems and networks. Different configurations and locations of programs and data are contemplated for use with techniques of described herein. In addition, the [server and/or client] may be physical or virtual computing systems and may reside on the same physical system. Also, one or more of the modules may themselves be distributed, pooled or otherwise grouped, such as for load balancing, reliability or security reasons. A variety of distributed computing techniques are appropriate for implementing the components of the illustrated embodiments in a distributed manner including but not limited to TCP/IP sockets, RPC, RMI, HTTP, Web Services (XML-RPC, JAX-RPC, SOAP, etc.) and the like. Other variations are possible. Also, other functionality could be provided by each component/module, or existing functionality could be distributed amongst the components/modules in different ways, yet still achieve the functions of an SAS.
Furthermore, in some embodiments, some or all of the components of the SAS 610 may be implemented or provided in other manners, such as at least partially in firmware and/or hardware, including, but not limited to one or more application-specific integrated circuits (ASICs), standard integrated circuits, controllers executing appropriate instructions, and including microcontrollers and/or embedded controllers, field-programmable gate arrays (FPGAs), complex programmable logic devices (CPLDs), and the like. Some or all of the system components and/or data structures may also be stored as contents (e.g., as executable or other machine-readable software instructions or structured data) on a computer-readable medium (e.g., a hard disk; memory; network; other computer-readable medium; or other portable media article to be read by an appropriate drive or via an appropriate connection, such as a DVD or flash memory device) to enable the computer-readable medium to execute or otherwise use or provide the contents to perform at least some of the described techniques. Some or all of the components and/or data structures may be stored on tangible, non-transitory storage mediums. Some or all of the system components and data structures may also be stored as data signals (e.g., by being encoded as part of a carrier wave or included as part of an analog or digital propagated signal) on a variety of computer-readable transmission mediums, which are then transmitted, including across wireless-based and wired/cable-based mediums, and may take a variety of forms (e.g., as part of a single or multiplexed analog signal, or as multiple discrete digital packets or frames). Such computer program products may also take other forms in other embodiments. Accordingly, embodiments of this disclosure may be practiced with other computer system configurations.
All of the above U.S. patents, U.S. patent application publications, U.S. patent applications, foreign patents, foreign patent applications and non-patent publications referred to in this specification and/or listed in the Application Data Sheet, including but not limited to U.S. Provisional Patent Application No. 62/189,656, entitled “SCALABLE ANNOTATION ARCHITECTURE,” filed Jul. 7, 2015, is incorporated herein by reference, in its entirety.
From the foregoing it will be appreciated that, although specific embodiments have been described herein for purposes of illustration, various modifications may be made without deviating from the spirit and scope of the invention. For example, the methods, systems, and techniques for performing data annotation discussed herein are applicable to other architectures other than a cloud-based architecture. Also, the methods, systems, and techniques discussed herein are applicable to differing application specific protocols, communication media (optical, wireless, cable, etc.) and devices (such as wireless handsets, electronic organizers, personal digital assistants, portable email machines, game machines, pagers, navigation devices such as GPS receivers, etc.).
This application claims the benefit of U.S. Provisional Patent Application No. 62/189,656, entitled “SCALABLE ANNOTATION ARCHITECTURE,” filed Jul. 7, 2015, which is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
62189656 | Jul 2015 | US |