Not applicable.
The present invention generally relates to the field of crowd sourcing and specifically to identifying specific workers who will provide a most efficient review of crowd sourced materials.
The disclosed invention considers context-heavy data processing tasks that may require many hours of work, and refer to such tasks as macrotasks. Leveraging the infrastructure and worker pools of existing crowd sourcing platforms, the disclosed invention automates macrotask scheduling, evaluation, and pay scales. A key challenge in macrotask-powered work, however, is evaluating the quality of a worker's output, since ground truth is seldom available and redundancy-based quality control schemes are impractical. The disclosed invention, therefore, includes a framework that improves macrotask powered work quality using a hierarchical review. This framework uses a predictive model of worker quality to select trusted workers to perform review, and a separate predictive model of task quality to decide which tasks to review. Finally, the disclosed invention can identify the ideal trade-off between a single phase of review and multiple phases of review given a constrained review budget in order to maximize overall output quality.
In some embodiments a server assigns section or list item classifications to price list or business data extracted from a website. The server calculates a crowd worker score for each of a plurality of crowd workers based on each worker's quality and speed scores for tasks reviewing the classifications on a worker user interface. If a crowd worker score for a worker is below a crowd worker quality threshold, each new task is routed to the worker, and the received task, when completed, is routed to a worker whose crowd worker score is above the crowd worker quality threshold for review.
In some embodiments a server assigns section or list item classifications to price list or business data extracted from a website. Each new task verifying the classification is routed to a crowd worker, and a completed task is received by the server. The server then calculates a crowd worker score for each of a plurality of crowd workers based on each worker's quality scores according to the worker's review of the classifications on a worker user interface. The server then generates a quality model for predicting a task quality score for the task, according to an error score for the crowd worker. If the error score in the quality model is below a predetermined threshold, the server automatically transmits the completed task to a client computer operated by at least one task reviewer for review.
In some embodiments a server assigns section or list item classifications to price list or business data extracted from a website. The server calculates a crowd worker score for each of a plurality of crowd workers based on each worker's quality and speed scores for tasks reviewing the classifications on a worker user interface. If a crowd worker score for a worker is below a crowd worker quality threshold, each new task is routed to the worker, and the received task, when completed, is routed to a worker whose crowd worker score is above the crowd worker quality threshold for review. The server then identifies a budget for the tasks, and repeats the process for subsequent tasks, transmitting reviewed tasks to a second level task reviewer according to a threshold number of reviewed tasks for second level review, based on the budget.
The above features and advantages of the present invention will be better understood from the following detailed description taken in conjunction with the accompanying drawings.
The present inventions will now be discussed in detail with regard to the attached drawing figures that were briefly described above. In the following description, numerous specific details are set forth illustrating the Applicant's best mode for practicing the invention and enabling one of ordinary skill in the art to make and use the invention. It will be obvious, however, to one skilled in the art that the present invention may be practiced without many of these specific details. In other instances, well-known machines, structures, and method steps have not been described in particular detail in order to avoid unnecessarily obscuring the present invention. Unless otherwise indicated, like parts and method steps are referred to with like reference numerals.
Systems that coordinate human workers to process data make an important trade-off between complexity and scale. As work becomes increasingly complex, it requires more training and coordination of workers. As the amount of work (and therefore the number of workers) scales, the overheads associated with that coordination increase. Worker organization models for task completion have significant implications for the complexity and scale of the work that can be accomplished with those models. Crowd sourcing has recently been used to improve the state of the art in areas of data processing such as entity resolution, structured data extraction, and data cleaning. Human computation is commonly used for both processing raw data and verifying the output of automated algorithms.
Crowd sourced workflows are used in research and industry to solve a variety of tasks. An important concern when assigning work to crowd workers with varying levels of ability and experience is maintaining high-quality work output. Thus, a prominent focus of the crowd sourcing literature has been on quality control: developing workflows and algorithms to reduce errors introduced by workers either unintentionally (due to innocent mistakes) or maliciously (due to collusion or spamming). Three organizational models are compared below: microtask-based decomposition, macrotasks, and traditional freelancer-based knowledge work. Several examples of problems solved at scale with macrotasks are provided.
Most research on quality control in crowd sourced workflows has focused on platforms that define work as microtasks, where workers are asked simple questions that require little context or training to answer. Microtasks are an attractive unit of work, as their small size and low cost make them amenable to quality control by assigning a task to multiple workers and using worker agreement or voting algorithms to surface the correct answer. Microtask research has focused on different ways of controlling this voting process while identifying the reliability of workers through their participation. Such research utilizes microtasks where crowd workers are asked to answer simple yes/no or multiple choice questions with little training.
Unfortunately, not all types of work can be effectively decomposed into microtasks. Microtasks are powerful, but fail in cases where larger context (e.g., domain knowledge) or significant time investment is needed to solve a problem, for example in large-document structured data extraction. Tasks that require global context (e.g., creating papers or presentations) are challenging to programmatically sub-divide into small units. Additionally, voting strategies as a method of quality control break down when applied to tasks with complex outputs, because it is unclear how to perform semantic comparisons between larger and more free-form results.
Thus, an alternative to seeking out good workers on microtask platforms and decomposing their assignments into microtasks is to recruit crowd workers to perform larger and more broadly defined tasks over a longer time horizon. Such a model allows for in-depth training, arbitrarily long-running tasks, and flexible compensation schemes. There has been little work investigating quality control in this setting, as the length, difficulty, and type of work can be highly variable, and defining metrics for quality can be challenging. Traditional freelancer-based knowledge work supports arbitrarily complex tasks, because employers can interact with workers in person to convey intricate requirements and evaluate worker output. This type of work usually involves an employer personally hiring individual contractors to do a fairly large task, such as designing a website or creating a marketing campaign. The work is constrained by hiring throughput and is not amenable to automated quality control techniques, limiting its ability to scale.
Another alternative includes macrotasks. Macrotasks represent a trade off between microtasks and freelance knowledge work, in that they provide the automation and scale of microtasks, while enabling much of the complexity of traditional knowledge work. In this disclosure, the term macrotask is used to refer to such complex work. This disclosure discusses both the limitations and the opportunities provided by macrotask processing, and then presents a framework that extends existing data processing systems with the ability to use high-quality crowd sourced macrotasks. The disclosed embodiments present the output of automated data processing techniques as the input to macrotasks and instructs crowd workers to eliminate errors. As a result, it easily extends existing automated systems with human workers without requiring the design of custom-decomposed microtasks. Macrotasks, a middle ground between microtasks and freelance work, allow complex work to be processed at scale. Unlike microtasks, macrotasks don't require complex work to be broken down into simpler subtasks: one can assign work to workers essentially as-is, and focus on providing them with user interfaces that make them more effective. Unlike traditional knowledge work, macrotasks retain enough common structure to be specified automatically, processed uniformly in parallel, and improved in quality using automated evaluation of tasks and workers. Much of the complex, large-scale data processing that incorporates human input is amenable to macrotask processing.
The following three non-limiting example, and high-level data-heavy use-cases, addressed with crowd-powered macrotask workflows at a scale of millions of tasks, demonstrate the utility of macrotasks: 1. Structured Price List Extraction. From Yoga studio service lists to restaurant menus, structured data from PDFs, HTML, Word documents, Flash animations, and images may be extracted on millions of small business websites. When possible, this content is automatically extracted, but if automated extraction fails, workers must learn a complex schema and spend upwards of an hour processing the price list data for a business. 2. Business Listings Extraction. ˜30 facts about businesses (e.g., name, phone number, wheelchair accessibility, etc.) are extracted in one macrotask per business. This task could be accomplished using either microtasks or macrotasks, and it is used to help demonstrate the versatility of the disclosed embodiments. 3. Web Design Choices. Crowd workers are asked to identify design elements such as color palettes, business logos, and other visual aspects of a website in order to enable brand-preserving transformations of website templates. These tasks are subjective and don't always have a correct answer: several color palettes might be appropriate for an organization's branding. This makes it especially challenging to judge the quality of a processed task.
The tasks above, with their complex domain-specific semantics, can be difficult to represent as microtasks, but are well-defined enough to benefit from significant automation at scale. Of course, macrotasks come with their own set of challenges, and are less predominant when compared to microtasks. There exist fewer tools for completing unstructured work, and crowd work platforms seldom offer best practices for improving the quality or efficiency of complex work. Tasks can be highly heterogeneous in their structure and output format, which makes the combination of multiple worker responses difficult and automated voting schemes for quality control nearly impossible. Macrotasks also complicate the design of worker pay structures, because payments must vary with task complexity.
To address the issues above, the disclosed embodiments leverage several cost-aware techniques for improving the quality of worker output. These techniques are domain-independent, in that they can be used for any data processing task and crowd work platform that collects and maintains basic data on individual workers and their work history. First, the disclosed embodiments organize the crowd hierarchically to enable trusted workers to review, correct, and improve the output of less experienced workers. Second, the disclosed embodiments provide a predictive model of task error, referred to herein as a TaskGrader, to effectively allocate trusted reviewers to the tasks that need the most correction. Third, the disclosed embodiments track worker quality over time in order to promote the most qualified workers to the top of the hierarchy. Finally, given a fixed review budget, the disclosed embodiments decide whether to allocate reviewer attention to an initial review phase of a task or to a secondary review of previously reviewed tasks in order to maximize overall output quality. Experiments show that generalizable features are more predictive of errors than domain specific ones, suggesting that the disclosed embodiments' models can be implemented in other settings with little task type specific instrumentation; The disclosure provides a non-limiting example evaluation of these techniques on a production structured data extraction system used in industry at scale. For review budget-constrained workflows, this example shows up to 118% improvement over random spot checks when combining TaskGrader with a two-layer review hierarchy, with greater benefits at more constrained budgets.
Put another way, the disclosed embodiments include the following: 1. A framework for managing macrotask-based workflows and improving their output quality given a fixed budget and fixed throughput requirement; 2. A hierarchical review structure that allows expert workers to catch errors and provide feedback to entry-level workers on complex tasks. The disclosed embodiments model workers and promote the ones that efficiently produce the highest-quality work to reviewer status. The examples herein show that 71.8% of tasks with changes from reviewers are improved; 3. A predictive model of task quality that selects tasks likely to have more error for review. 4. Empirical non-limiting example results that show that under a constrained budget where not every task can be reviewed multiple times, there exists an optimal trade-off between one-level and two-level review that catches up to 118% more errors than random spot checks.
The described embodiments may include one or more computing machines (including one or more server computers and one or more client computers), and one or more databases communicatively coupled through a network. The server and client may include at least one processor executing instructions within a communicatively coupled memory, the instructions causing the computing machines to execute the method steps disclosed herein. The server may store, within a database coupled to the network, a plurality of data, possibly organized into data records and data tables.
A task requester may access a task framework user interface (UI) on a client computer, in order to create a request (“framework?”) for multiple macrotasks (e.g., tasks for identifying and classifying, within website content, menu sections, menu items, prices, and specific context sensitive items, such as adding chicken $4, shrimp $7, or salmon $8 to salad). The requester may input multiple parameters defining the task framework including, for example: a budget and/or throughput requirement; multiple URIs or electronic documents containing task-related content to be crawled in association with the task framework; customized parameters within an API defining a generic schema including grammars used to identify context clues (e.g., HTML tags/attributes, XML tags/attributes, fonts, color schemes, style sheets, etc.) and classify groupings of content (e.g., menu item, menu price, menu section, etc.) within a web page at the URI or within the electronic documents as received, according to the schema; and customized definitions for UI controls, to be accessed by crowd workers in order to verify that classifications assigned to the task content are correct. The user then submits all task framework data to one or more servers, which receives the data and stores it within the database.
In response to receiving the task framework data, the server automatically executes a crawl of the content for each of the designated URIs or other electronic documents, classifies the content according to the context clues defined within the content schema, and stores the content classifications (representing the server's best guess of the content classification) as data records in the database, in association with the task framework, and possibly the crawled URI. The server then renders and transmits, for display on a crowd worker client machine, a UI display allowing crowd workers to verify and/or correct the classifications of the crawled content. In some embodiments, the UI display may include a rendering of the content within a browser as displayed in the web page at the URI or within the electronic document. The UI display may also include an editable display of the data records representing the content as automatically classified by the server.
More experienced crowd workers may train new (or less experienced) crowd workers in analyzing the server's classification for each task (i.e., each URI or electronic document displayed in the crowd worker UI) to determine if the server's automatic classification for the content is correct. The crowd worker being trained may compare the content within the content displayed in the browser, and correct any necessary content classifications by inputting the corrections within the editable display. The crowd worker may submit the task when complete. After decoding the transmission of the submitted task, the server may determine the total amount of content modified by the new crowd worker (e.g., number of lines changed, or percent of content changed compared to the total content). The server may then store the amount of content modified, in association with the designated task, within the database. The server may also determine the task speed (e.g., the time it took the worker to complete the task, possibly the amount of time between the crowd worker receiving the task and submitting it to the server) and stores this data, association with the task, in the database.
Initially, the more experienced crowd worker, or other reviewer, may review each task submitted by the new or less experienced crowd worker, and may identify and correct any errors in the submitted task (possibly using a crowd worker UI designed to review tasks). The reviewer may then submit the review, and the server again determines the amount/percentage of content modified (between the original or previous submission and the review), as well as the task speed for the review, and stores the percentage of modified content and task speed in the database in association with the task. This review process may be repeated as many times as necessary to bring the tasks quality rate above a threshold determined by the task framework budget.
As tasks are completed by each crowd worker, the server may calculate a score for the crowd worker for which the tasks were submitted, based on the quality and the speed with which the crowd worker completed the task. The quality of the task may be calculated as the inverse of the percentage of content modified in reviews of the task. Thus, if a task was reviewed, and 5% of the content was modified by the reviewer (presumably because it was incorrect), the crowd worker would have a 95% quality score for that task (possibly calculated as a decimal, 0.95). The server may analyze the quality scores for all of the crowd worker's tasks at a 75th percentile error rate (associated in the database with the task framework) to calculate an overall quality score for that crowd worker for that request.
This quality scoring process may be repeated for all crowd workers associated in the database with the request, and in some embodiments, the range of quality scores may be normalized, so that the highest quality score is a 1, and the lowest quality score is a 0. The server may then re-calculate each crowd worker's quality score relative to these normalized scores.
Similarly, the server's calculation of the speed element of each crowd worker's score may be a function of selecting the task speed data for all tasks associated with the task framework, and normalizing the highest task speed to 1, and the lowest task speed to 0. The server may then calculate each crowd worker's score relative to these normalized scores, possibly as a decimal representation of the average task speed for that crowd worker, as a percentage of the normalized fastest or slowest score.
The server may then calculate each crowd worker's total quality score as a weighted average between the crowd worker's task quality score and task speed score. Each crowd worker's score may be re-calculated relative to all crowd workers' scores associated with that request each time a submitted task associated in the database with that crowd worker is reviewed.
The server may organize all crowd workers trained for tasks within a specific task framework into a hierarchy of crowd workers by generating a total score for the crowd workers, and ranking them according to their total score. The server may then select the data record defining the budget and any throughput requirements for the task framework and calculate the number tasks, the percentage of completed tasks to review, and the percentage of completed tasks needing a second or subsequent review according to the budget and throughput requirements.
According to these calculations, the server may determine a percentage of the crowd workers for the specific task framework to be designated as data entry specialists (DES), first level reviewers, and second level reviewers needed, and may organize this hierarchy according to the crowd worker rank determined above. As additional tasks are reviewed, and the server re-calculates the scores and ranks for the most recently reviewed tasks, the server may dynamically update the hierarchy to re-designate crowd workers to new levels within the hierarchy, according to the budget and throughput requirements.
For each new completed task submitted by DES workers within the hierarchy, the server may identify the crowd worker identifier associated with the completed task, and identify that crowd worker's quality score (i.e., the normalized inverse of the average percentage of content corrected in that worker's most recent reviewed tasks, at the 70th percentile error rate). Based on this quality score, the server may calculate a predictive error rate/quality score for the most recently received completed task. The server may then compare this score with a threshold error rate, determined by the budget and/or throughput parameters, and if the quality score is below this threshold, the completed task may be flagged for review. All tasks flagged for review may be automatically forwarded by the server to a reviewer for review. This process may be repeated for subsequent levels of review until the predicted quality score no longer falls below the threshold.
Turning now to
The described embodiments may include one or more computing machines (including one or more server computers and one or more client computers 115) and one or more databases communicatively coupled through a network. The server and client 115 may include at least one processor executing instructions within a communicatively coupled memory, the instructions causing the computing machines to execute the method steps disclosed herein. The server may store, within a database, a plurality of data, possibly organized into data records and data tables.
As non-limiting examples, the processor on the server may execute the instructions including, as non-limiting examples, one or more software modules, such as one or more task manager software modules 100, one or more task grader software modules 105, one or more worker manager software modules 110, one or more worker model software modules 120, and/or one or more task router software modules 125. The data received from the client computer 115 and/or from calculations run by the disclosed software modules may be stored by the server in the database and decoded and executed by the processor within memory according to the software instructions within the disclosed software modules to complete the method steps disclosed herein.
This section provides an overview of a task framework that combines automated models with complex crowd tasks. This task framework is a scheme for quality control in macrotasks that can generalize across many applications in the presence of heterogeneities task outputs. This task framework may be used for performing several data processing tasks, but will use structured data extraction as a running example. To reduce error introduced by crowd workers while remaining domain-independent, the task framework uses three complementary techniques that are described next: a review hierarchy, predictive task modeling, and worker modeling. These techniques are effective when dealing with tasks that are complex and highly context-sensitive, but still have structured output.
Turning now to
A task requester may create a task framework defining the details of the tasks to be distributed among the hierarchy of crowd workers. The task requester may access a task framework UI, displayed on a client computer 115, in order to define the task framework for the tasks that the task requester is requesting. This task framework may define: multiple macrotasks the requester wants performed; a classification schema defining parameters that the server computer uses to automatically extract and assign classifications to the content; identify designated documents (e.g., crawled web pages, uploaded price lists), to which the classification schema and extractors apply; and/or definitions of UI elements to be displayed to crowd workers as they determine if the classifications assigned to the content by the automatic extractors are correct.
The task requester may also input budget and/or fixed throughput information in association with the requested task framework. The server may store, within the database, task framework data input by the requester or other user. In some embodiments, each task framework data may be stored within its own data record, in a data table storing task framework information, such as the example data table below.
Each data record in this example data table may include: a task framework id data field storing a unique id associated with task framework; a task framework name data field naming or describing the task framework; a data field storing the number of tasks to be completed; and a budget data field storing the budget for the requested task framework.
In the example data table above, the server may receive the task framework data, and automatically generate and store the data record with a task framework id 1, with a task framework name “Menu Classification,” a number of tasks set at 1000, and a budget of $25,000. This example task framework data table also includes an additional data record subsequently received by the server. Though beyond the scope of the disclosed embodiments, additional data tables and data records may also store task framework details relating to the content extraction and classification schemas and crowd worker UI controls, described below.
The task requester may access, possibly via the task framework UI, an API defining a generic task framework for macrotasks that the task requester may want to request. In the case of the non-limiting price list extraction task example, the generic framework may include a content schema and a collection of generic parameters including machine learned classifiers stored within the database and used to identify potential menu sections, menu item names, prices, descriptions, and item choices and additions (e.g., identifying and classifying, within a restaurant website content, menu sections, menu items, prices, and specific context sensitive items, such as adding chicken $4, shrimp $7, or salmon $8 to salad).
These machine-learned classifiers may define the parameters which the server computer uses to execute software that acts as automated extractors (explained in more detail below), in order to analyze, classify and extract content while crawling designated websites or receiving uploaded price lists, for example. These parameters may include generic parameters for grammars within the schema used to define context clues (e.g., HTML tags/attributes, XML tags/attributes, fonts, color schemes, cascading style sheets, etc.) used to identify and/or classify content within a web page, website, and/or received price list (e.g., menu item, menu price, menu section, etc.).
The requester, using the framework UI, may further customize the content schema for the generic task framework according to user-specific input modifying or adding to the parameters of the generic framework. These additional parameters may include one or more new macrotask types. To define a new macrotask type, a developer using the disclosed embodiments provide task data. Users must implement a method that provides task-specific data encoded as JSON for each task. Such data might be serialized in various ways. For example, business listings tasks produce a key-value mapping of business attributes (e.g., phone numbers, addresses). For price lists, a markup language allows workers to edit blocks of text and label them (e.g., sections, menu items).
The requestor may also provide the technical parameters for a method within one or more worker interface renderer software modules running on the server. The technical parameters for these methods may include customized definitions for the UI controls for the worker interface, used by the worker to verify that the extractors' classifications of the website content or uploaded price lists are correct. Users adding a new macrotask type to the disclosed framework need not write any backend code to manage tasks or workers. They simply build the user interface for the task workflow and wire it up to the framework's API.
Other interface features (e.g., a commenting interface for workers to converse, buttons to accept/reject a task) are common across different task types and provided by the disclosed embodiments.
The requester may also provide one or more error metrics. Given two versions of task data (e.g., an initial and a reviewed version), an error metric helps the TaskGrader, described below, determine how much that task has changed. For textual data, this metric might be based on the number of lines changed, whereas more complex metrics are required for media such as images or video. Users can pick from the disclosed embodiments' pre-implemented error metrics or provide one of their own.
The task requester may also designate a collection of one or more URIs or data sources identifying the web pages/websites to be crawled, and/or one or more data sources for the uploaded or received price lists, in association with the tasks to be completed for the requested task framework. The user then submits the task framework/request data to one or more servers, which receives the data and stores it within the database.
In response to receiving the task request data, the server may automatically executes a crawl of the content for each of the designated URIs, and/or analyze the price list data uploaded from the designated data source(s).
The server may run the software modules implementing the automated extractors, in order to classify the content of each URI and/or uploaded price list making up a task, according to the machine learned classifiers, using the context clues defined within the content schema. For example, automated extractors (e.g., optical character recognition, flash decompilation), and machine learned classifiers 305 may identify potential menu sections, menu item names, prices, descriptions, and item choices and additions. Using the automated extractor software 305, the server may store the content classifications (representing the server's best guess of the content classification) as data records in the database, in association with the crawled URI or price list identifying the task framework.
The server may store, within the database, extracted task data generated as the server runs the content extractor software modules. In some embodiments, each extracted task data may be stored within its own data record, in a data table storing extracted task information, such as the example data table below.
Each data record in this example data table may include: an extracted task id data field storing a unique id associated with the extracted task; a task framework id data field associating the extracted task with a task framework; a menu id data field associating the extracted task with a menu (e.g., “Brunch”, not shown), an extracted item data field naming the extracted menu item; a description data field describing the extracted menu item; and a price data field storing a price for the extracted menu item.
In the example data table above, the server may run the content extractor software, and automatically generate and store the data record with a extracted task id 1, a task framework id of 1, a menu id of 1 (“Brunch”), an item name of anis eggs benedict, a description of Poached eggs on toasted brioche, with black forest ham, hollandaise and Lyonnaise potatoes, and a price of $12. This example task framework data table also includes an additional data record subsequently received by the server.
The resulting crowd-structured data is used to periodically retrain classifiers to improve their accuracy. The macrotask model provides for lower latency and more flexibility in throughput when compared to a freelancer model. One requirement for the use of these price list extraction tasks is the ability to handle bursts and lulls in demand. Additionally, for some tasks, very short processing times may be required. These constraints make a freelancer model, with slower on-boarding practices, less well suited to this example problem than macrotasks.
Microtasks are also a bad fit for this price list extraction task. The tasks are complex, as workers must learn the markup format and hierarchical data schema to complete tasks, often taking 1-2 weeks to reach proficiency. Using a microtask model to complete the work would require decomposing it into pieces at a finer granularity than an individual menu. Unfortunately, the task is not easily decomposed into microtasks because of the hierarchical data schema: for example, menus contain sections which contain subsections and/or items, and prices are frequently specified not only for items, but for entire subsections or sections. There would be a high worker coordination cost if such nested information were divided across several microtasks. In addition, because raw menu text appears in a number of unstructured formats, deciding how to segment the text into items or sections for microtask decomposition would be a challenging problem in its own right, requiring machine learning or additional crowdsourcing steps. Even if microtask decomposition were successful, traditional voting-based quality control schemes would present challenges, as the free-form text in the output format can vary (e.g. punctuation, capitalization, missing/additional articles) and the schema requirements are loose. Most importantly, while it might be possible in some situations to generate hundreds of microtasks for each of the hundreds of menu items in a menu, empirical estimates based on business process data suggests that the fair cost of a single worker on the complex version of these tasks is significantly lower than the redundant version of the many microtasks it would take to process most menus.
In the following sections, the system designed for implementing the price lists task and other macrotask workflows will be described, focusing specifically on the challenges of improving work quality in complex tasks.
Turning now to
As seen in
Turning now to
The server may store, within the database, crowd worker data input by a system administrator or other user. In some embodiments, each crowd worker may be stored within its own data record, in a data table storing crowd worker data, such as the example data table below.
Each data record in this example data table may include: a crowd worker id data field storing a unique id associated with each crowd worker; a task framework id data field referencing a data record within the task framework data table and identifying a task framework associated with the crowd worker id; a first name data field storing the first name of the crowd worker; and a last name data field storing the last name of the crowd worker.
In the example data table above, the server may receive the crowd worker data, and automatically generate and store the data record with a crowd worker id 1, with a first name “John,” and with a last name “Doe” This example crowd worker data table also includes an additional data record subsequently received by the server.
The crowd worker being trained may examine the content created by the content extractors, compare it with the content displayed in the browser, and correct any necessary content classifications by inputting the corrections within the editable display. As noted above,
After decoding the transmission of the submitted task, the server may determine the total amount of content modified by the DES (e.g., number of lines changed, or percent of content changed compared to the total content). The server may then store the amount of content modified, in association with the designated task, within the database.
The server may also determine the task speed (e.g., the time it took the worker to complete the task, possibly the amount of time between the crowd worker receiving/beginning the task and submitting it to the server) and store this data associated with the task and the crowd worker in the database.
High quality is achieved through review, corrections, and recommendations of educational content to entry-level workers. Initially, the more experienced crowd worker, or another reviewer, may therefore review each task submitted by the new or less experienced crowd worker (possibly using a crowd worker UI designed to review tasks, not shown, but possibly similar to the review UI shown in
The server may receive the review submission and analyze the submission to determine the amount/percentage of content modified from the original task submission (or any previous review submission), as well as the task speed for the review, and store the amount/percentage of modified content and task speed in the database in association with the task. This review process may be repeated as many times as necessary to bring the tasks quality rate above a threshold determined by the request budget (described in more detail below).
As tasks are completed by each crowd worker, the server may calculate a score for each task submitted by each crowd worker, based on the quality and the speed with which the crowd worker completed the task. A key aspect of the disclosed embodiments is the ability to identify skilled workers to promote to reviewer status. In order to identify which crowd workers to promote near the top of the hierarchy (described below), a metric may be developed by which all workers are ranked, composed of two components: The first component is work quality. The quality of the task may be calculated as the inverse of the percentage of content modified in reviews of the task. Thus, if a task was reviewed, and 5% of the content was modified by the reviewer (presumably because it was incorrect), the crowd worker would have a 95% quality score for that task (possibly stored as a decimal, 0.95).
Given all of the tasks a worker has completed recently, the error score may be taken of their 75th percentile worst score. It is shown below that worker error percentiles around 80% are the most important worker-specific feature for determining the quality of a task. The server may store, within the database, crowd worker task quality score data calculated by the server. In some embodiments, each crowd worker task quality score may be stored within its own data record, in a data table storing task quality, such as the example data table below.
Each data record in this example data table may include: a task quality score id data field storing a unique id associated with each crowd worker task quality score; a worker id data field referencing a data record within the crowd worker data table and identifying a crowd worker associated with the crowd worker task quality score; a task framework id data field referencing a data record within the task framework data table and identifying a task framework associated with the crowd worker quality score; a task id referencing the task for which the crowd worker task quality score was calculated; and a quality score data field storing the calculated (and possibly normalized) quality score for that task.
In the example data table above, the server 110 may calculate the quality score for each received task, and automatically generate and store the data record with a quality score id 1, referencing crowd worker 1 (John Doe), framework 1 (Menu price list), task 1 (anis eggs benedict), and a quality score for task 1 of 0.25 (e.g., 75% of the content changed after review). This example crowd worker data table also includes additional data records subsequently received by the server.
The second component of the ranking metric is work speed. How long each worker takes to complete tasks on average may be measured. The server's calculation of the speed element of each crowd worker's score may be a function of selecting the task speed data for all tasks associated in the database with an identification for the task framework, and normalizing the highest task speed (e.g., the fewest number of minutes between receipt and completion of a task) to 1, and the lowest task speed (e.g., the greatest number of minutes between receipt and completion of a task) to 0. The server may then calculate each crowd worker's score relative to these normalized scores, possibly as a decimal representation of the average task speed for that crowd worker, as a percentage of the normalized fastest or slowest score.
The server may store, within the database, crowd worker speed score data calculated by the server. In some embodiments, each crowd worker speed score may be stored within its own data record, in a data table storing task speed, such as the example data table below.
Each data record in this example data table may include: a speed score id data field storing a unique id associated with each crowd worker speed score; a worker id data field referencing a data record within the crowd worker data table and identifying a crowd worker associated with the crowd worker speed score; a task framework id data field referencing a data record within the task framework data table and identifying a task framework associated with the crowd worker speed score; a task id referencing the task for which the crowd worker quality score was calculated; a time data field storing the time it took to complete the task (e.g., 5 minutes); and a speed score data field storing the calculated (and possibly normalized) quality score for that task.
In the example data table above, the server may calculate the speed score for each received task, and automatically generate and store the data record with a speed score id 1, referencing crowd worker 1 (John Doe), framework 1 (Menu price list), task 1 (anis eggs benedict), and a quality score for task 1 of 0.9 (e.g., 90% of the fastest speed score, which was normalized to 1). This example crowd worker data table also includes additional data records subsequently received by the server.
This quality scoring process may be repeated for all crowd workers associated in the database with the framework defining the framework-related tasks. All workers may be sorted by their 75th percentile error score, and each worker may be assigned a score from 0 (worst) to 1 (best) based on this ranking. All workers may be ranked by how quickly they complete tasks, assigning workers a score from 0 (worst) to 1 (best) based on this ranking. Thus, in some embodiments, the range of quality scores may be normalized, so that the highest quality score is a 1, and the lowest quality score is a 0. The server may then re-calculate each crowd worker's quality score relative to these normalized scores.
A weighted average of these two metrics may be taken as a worker quality measure. The server may calculate each crowd worker's total score as a weighted average between the crowd worker's quality score and speed score. Each crowd worker's score may be re-calculated relative to all crowd workers' scores associated with that task framework each time a submitted task associated in the database with that crowd worker is reviewed. With this overall score for each worker, workers may be promoted, demoted, provided bonuses, or contracts may be ended, depending on overall task availability.
The server may store, within the database, crowd worker quality score data calculated by the server. In some embodiments, each crowd worker quality score may be stored within its own data record, in a data table storing crowd worker quality scores, such as the example data table below.
Each data record in this example data table may include: a crowd worker quality score id storing a unique id associated with the crowd worker quality score; a crowd worker id data field referencing a data record within the crowd worker data table and identifying a crowd worker associated with the crowd worker quality score id; a task framework id data field referencing a data record within the task framework data table and identifying a task framework associated with the crowd worker id; a quality score data field storing the crowd worker's normalized quality score; a speed score data field storing the crowd worker's normalized speed score; and a total score data field storing the crowd worker's normalized total score based on the weighted average between the quality score and the speed score.
In the example data table above, the server may calculate the quality, speed, and total scores for each crowd worker, and automatically generate and store the data record with a crowd worker quality score id 1, referencing crowd worker 1 (John Doe), framework 1 (Menu price list), and storing a quality score of 0.25, a speed score of 0.9, and a total score of 0.7. This example crowd worker data table also includes additional data records subsequently received by the server.
To achieve high task quality, the disclosed embodiments identify a crowd of trusted workers and organizes them in a hierarchy with the most trusted workers at the top. The server may therefore update the data records for all crowd workers, trained for tasks for a specific task framework, into a hierarchy of crowd workers by generating a total score for the crowd workers according to the method steps above, and ranking them according to their total normalized score.
The review hierarchy is depicted in
Because per-task feedback only provides one facet of worker training and development, The disclosed embodiments may rely on a crowd Manager to develop workers more qualitatively. This Manager is manually selected from the highest quality Reviewers, and handles administrative tasks while fielding questions from other crowd workers. The Manager also looks for systemic misunderstandings that a worker has, and sends personalized emails suggesting improvements and further reading. Workers receive such a feedback email at least once per month. In reviewing workers, the Manager also recommends workers for promotion/demotion, and this feedback contributes to hierarchy changes. If the Manager spots an issue that is common to several workers, the Manager might generate a new training document to supplement workers' education. Although the crowd hierarchy is in this way self-managing, the process of on-boarding users and ending contracts is not left to the Manager: it requires manual intervention by the framework user.
As additional tasks are reviewed, and the server re-calculates the scores and ranks for the most recently reviewed tasks, the server may dynamically update the hierarchy to reassign crowd workers to new levels within the hierarchy, possibly limited by the task framework's fixed throughput and budget, discussed above. Workers are therefore incentivized to complete work quickly and at a high level of quality. A worker's speed and quality rankings are described in more detail above, but in short, workers are ranked by how poorly they performed in their middling-to-worst tasks, and by how quickly they completed tasks relative to other workers. Given this ranking, workers are automatically promoted or demoted by the server appropriately on a regular basis.
Reviewers are paid an hourly wage, while DES are paid a fixed rate based on the difficulty of their task, which can be determined after a reviewer ensures that they have done their work correctly. This payment mechanism incentivizes Reviewers to take the time they need to give workers meaningful feedback, while DES are incentivized to complete their tasks at high quality as quickly as possible. Based on typical work speed of a DES, Reviewers receive a higher hourly wage. The Manager role is also paid hourly, and earns the highest amount of all of the crowd workers. As a further incentive to do good work quickly, workers are rate-limited per week based on their quality and speed over the past 28 days. For example, the top 10% of workers are allowed to work 45 hours per week, the next 25% are allowed 35 hours, and so on, with the worst workers limited to 10 hours.
For each new completed task submitted by DES workers within the hierarchy, the server may identify the crowd worker identifier associated in the database with the crowd worker that submitted the completed task, and identify that crowd worker's quality score (i.e., the normalized inverse of the average percentage of content corrected in that worker's most recently reviewed tasks, as determined at the worker's 75% error rate).
A predictive model, referred to as TaskGrader herein, decides which tasks to review. TaskGrader leverages, from the crowd worker identified in association with the submitted completed task, available worker context, work history, and past reviews to train a regression model that predicts an error score used to decide which tasks are reviewed. The goal of the TaskGrader is to maximize quality, which are measured as the number of errors caught in a review of the crowd worker's submitted completed tasks, as reflected in the selected data records associated with the worker's previously completed tasks.
The server may predict the quality score of the submitted and completed task according to an error metric. Given two versions of task data within one or more data records of the crowd worker associated with the most recently submitted completed tasks (e.g., an initial and a reviewed version), an error metric helps the TaskGrader, described herein, to determine how much that task has changed. For textual data, this metric might be based on the number of lines changed, whereas more complex metrics are required for media such as images or video. As noted in regard to the requester described above, users can pick from the disclosed embodiments' pre-implemented error metrics or provide one of their own.
In order to generate ground truth training data for a supervised regression model, past data from the hierarchical review model may be taken advantage of. The fraction of output lines of a task that are incorrect as an error metric, as stored in the data records associated in the database with the crowd worker who submitted the most recently completed tasks, may be used. This value may be approximated by measuring the lines changed by a subsequent reviewer of a task, as stored in the data records associated in the database with the crowd worker who submitted the most recently completed tasks. Training labels may be computed by measuring the difference between the output of a tasks in these data records before and after review. Thus, tasks that have been reviewed in the hierarchy are usable as labeled examples for training the model.
An online algorithm may be used for selecting tasks to review, because new tasks continuously arrive on the system. This online algorithm frames the problem as a regression: the TaskGrader predicts the amount of error in a task, having dynamically set a review threshold at runtime in order to review tasks with the highest error without overrunning the available budget. If we assumed a static pool of tasks, the problem might better be expressed as a ranking task.
The server may then identify the budget submitted by the requester of the task framework to determine if the predicted quality score for the user falls within the range of scores determined by the budget to be in need of review. To ensure a consistent review budget (e.g., 40% of tasks should be reviewed), a threshold must be picked for the TaskGrader regression in order to spend the desired budget on review. Depending on periodic differences in worker performance and task difficulty, this threshold can change. Every few hours, the TaskGrader score distribution may be loaded for the past several thousand tasks and empirically set the TaskGrader review threshold to ensure that the threshold would have identified the desired number of tasks for review. In practice, this procedure results in accurate TaskGrader-initiated task review rates. This process may be repeated for subsequent levels of review until the predicted quality score no longer falls within the range of scores determined by the budget to be in need of review.
The space of possible implementations of TaskGrader spans three objectives: The first objective is throughput, which is the total number of tasks processed. For the design of TaskGrader, throughput is held constant and the initial processing of each task is viewed as a fixed cost. The second objective is cost, which is the amount of human effort spent by the system measured in tasks counts. this constant is held at an average of 1:56 workers per task (a parameter which should be set based on available budget and throughput requirements). The TaskGrader can allocate either 1, 2, or 3 workers per task, subject to the constraint that the average is 1:56. The third objective is quality, which is the inverse of the number of errors per task. Quality is difficult to measure in absolute terms, but can be viewed as the steady state one would reach by applying infinite number of workers per task. Quality is approximated by the number of changes (which is assumed to be errors fixed) made by each reviewer. The goal of the TaskGrader is to maximize the amount of errors fixed across all reviewed tasks.
Care should be taken with the tasks picked for future TaskGrader training. Because tasks selected for review by the TaskGrader are biased toward high error scores, they cannot be used to unbiasedly train future TaskGrader models. A fraction of the overall review budget may be reserved to randomly select tasks for review, and train future TaskGrader models on only this data. For example, if 30% of tasks are reviewed, the aim should be to have the TaskGrader select the worst 25% of tasks, and select another 5% of tasks for review randomly, only using that last 5% of tasks to train future models.
Occasionally users of the system may need to apply domain-specific tweaks to the error score. The task error score may be presented as the fraction of the output lines found incorrect in review. In its pure form, the score should lend itself reasonably well to various text-based complex work. However, one must be careful that the error score is truly representative of high or low quality. In this scenario, workers can apply comments throughout a price list's text to explain themselves without modifying the displayed price list content (e.g., \# I couldn't find a menu on this website, leaving task empty”). Reviewers sometimes changed the comments for readability, causing the comments to appear as line differences, thus affecting the error score. These comments are not relevant to the output, so workers may have been penalized for differences that were not important. For near-empty price lists, this had an especially strong effect on the error score and skewed the results. When the system was modified to remove comments prior to computing the error score, the accuracy rose by nearly 5%.
The system may then apply machine learning. For example, as noted above, machine learned classifiers identify potential menu sections, menu item names, prices, descriptions, and item choices and additions. If automated extraction works perfectly, the crowd worker's task is simple: mark the task as being in good condition. If automated extraction fails, a crowd worker might spend up to hours manually typing all of the contents of a hard-to-extract menu. The resulting crowd-structured data is used to periodically retrain classifiers to improve their accuracy. The resulting crowd-structured data is used to periodically retrain classifiers to improve their accuracy.
A structured data extraction workflow was described above. Since macrotasks power its crowd component, and because the automated extraction and classifiers do not hit good enough precision/recall levels to blindly trust the output, at least one crowd worker looks at the output of each automated extraction. In this scenario, there is still benefit to a crowd-machine hybrid: because crowd output takes the same form as the output of the automated extraction, the disclosed extraction techniques can learn from crowd relabeling. As they improve, the system requires less crowd work for high-quality results. This active learning loop applies to any data processing task with iteratively improvable output: one can train a learning algorithm on the output of a reviewed task, and use the model to classify future tasks before humans process them in order to reduce manual worker effort.
Once the initial hierarchy has been trained and assembled, growing the hierarchy or adapting it to new macrotask types is efficient. Managers streamline the development of training materials, and although new workers require time to absorb documentation and work through examples, this training time is significantly lower than the costs associated with the traditional freelance knowledge worker hiring process.
The TaskGrader uses a variety of data collected on workers as features for model training. Table 1 describes and categorizes the features used. These features may be categorized into two groupings:
In this section, we evaluate the impact of the techniques proposed above on reducing error in macrotasks and investigate whether these techniques can generalize to other applications. We base our evaluations on a crowd workflow that has handled over half a million hours of human contributions, primarily for the purpose of doing large-scale structured web data extraction. We show that reviewers improve most tasks they touch, and that workers higher in the hierarchy spend less time on each task. We find that the TaskGrader focuses reviews on tasks with considerably more errors than random spot-checking. We then train the TaskGrader on varying subsets of its features and show that domain-independent (and thus generalizable) features are sufficient to significantly improve the workflow's data quality, supporting the hypothesis that such a model can add value to any macrotask crowd workflow with basic logging of worker activity. We additionally show that at constrained review budgets, combining the TaskGrader and a multilayer review hierarchy uncovers more errors than simply reviewing more tasks in single-level review. Finally, we show that a second phase of review often catches errors in a different set of tasks than the first phase.
We have developed a trained crowd of ˜300 workers, which has spiked to almost 1000 workers at various times to handle increased throughput demands. Currently, the crowd's composition is approximately 78% DES, 12% Reviewers, and 10% top-tier Reviewers. Top-tier Reviewers can review anyone's output, but typically review the work of other Reviewers to ensure full accountability. The Manager sends 5-10 emails a day to workers with specific issues in their work, such as spelling/syntax errors or incorrect content. He also responds to 10-20 emails a day from workers with various questions and comments.
The throughput of the system varies drastically in response to business objectives. The 90th percentile week saw 19 k tasks completed, and the 99th percentile week saw 33 k tasks completed, not all of which were structured data extraction tasks. Tasks are generally completed within a few hours, and 75% of all tasks are completed within 24 hours.
We evaluate our techniques on an industry deployment of Argonaut, in the context of the complex price list structuring task described above. The crowd forming the hierarchy is also described above. The training data consisted of a subset of approximately 60 k price list-structuring tasks that had been spot-checked by Reviewers over a fixed period. Most tasks corresponded to a business, and the worker is expected to extract all of the price lists for that business. The task error score distribution is heavily skewed toward 0:62% of tasks have an error score less than 0.025. If the TaskGrader could predict these scores, we could decrease review budgets without affecting output quality. 27% of the tasks contain no price lists and result in empty output. This happens if, for example, the task links to a website that does not exist, or doesn't contain any price lists. For these tasks, the error score is usually either 0 or 1, meaning the worker correctly identified that the task is empty, or they did not.
We evaluate the effectiveness of review in several ways, starting with expert coding. Two authors looked at a random sample of 50 tasks each that had changed by more than 5% in their first review. The authors were presented with the pre-review and post-review output in a randomized order so that they could not tell which was which. For each task, the authors identified which version of the task, if any, was of higher quality. The two sets of 50 tasks overlapped by 25 each, so that we could measure agreement rates between authors, and resulted in 75 unique tasks for evaluation.
For the 25 tasks on which authors overlapped, two were discarded because the website was no longer accessible. Of the remaining 23 tasks, authors agreed on 21 of them, with one author marking the remaining 2 as indistinguishable in quality. Given that authors agreed on all of the tasks on which they were certain, we find that expert task quality coding can be a high-agreement activity.
Table 2 summarizes the results of this expert coding experiment. Of 75 tasks, 4 were discarded for technical reasons (e.g., website down). Of the remaining 71, the authors found 13 to not be discernibly different in either version. On 51 of the tasks, the authors agreed that the reviewed version was higher-quality (though they were blind to which task had been reviewed when making their choice). This suggests that, on our data thresholded by ≧5% of lines changed, we found that review decreases quality 9.9% of the time, does not discernibly change quality 18.3% of the time, and improves quality 71.8% of the time. These findings point toward the key benefit of the hierarchy: when a single review phase causes a measurable change in a task, it improves output with high probability.
Since task quality varies, it is important for the TaskGrader to identify the lowest-quality tasks for review. We trained the TaskGrader, a gradient boosting regression model, on 90% of the data as a training set, holding out 10% as a test set. We compared gradient boosting regression to several models, including support vector machines, linear regression, and random forests, and used cross-validation on the training set to identify the best model type. We also used the training set to perform a grid search to set hyperparameters for our models.
We evaluate the TaskGrader by the aggregate errors it helps us catch at different review budgets. To capture this notion, we compute the errors caught (represented by the percentage of lines changed in review) by reviewing the tasks identified by the TaskGrader. We compare these to the errors caught by reviewing a random sample of N percent of tasks.
We now simultaneously explore which features are most predictive of task error and whether the model might generalize to other problem areas. As previously discussed, we broke the features used to train the TaskGrader into two groupings: task-specific vs worker-specific, and generalizable vs. domain-specific. We now study how these groupings affect model performance.
Generalizable features perform comparably to domain-specific ones. Because features unrelated to structured data extraction are still predictive of task error, it is likely that the TaskGrader model can be implemented easily in other macrotask scenarios without losing significant predictive power.
For our application, it is also interesting to note that task-specific features, such as work time and percent of input changed, outperform worker-specific features, such as mean error on past tasks. This finding is counter to the conventional wisdom on microtasks, where the primary approaches to quality control rely on identifying and compensating for poorly-performing workers. There could be several reasons for this difference: 1) over time, our incentive systems have biased poorly performing workers away from the platform, dampening the signal of individual worker performance, and 2) there is high variability in macrotask difficulty, so worker-specific features do not capture these effects as well as task-specific ones.
The TaskGrader is applied at each level of the hierarchy to determine if the task should be sent to the next level.
We also examined how the amount of error caught would change if we split our budget between Review 1 and Review 2, using the TaskGrader to help us judge if we should review a new task (Review 1), or review a previously reviewed task (Review 2). This approach might catch more errors by reviewing the worst tasks multiple times and not reviewing the best tasks at all.
Examining the figure, we see that for a given budget, there is an optimal trade-off between level 1 and level 2 review. Table 3 shows the optimal percent of tasks to review twice along with the improvement over random review at each budget. As the review budget decreases, the benefit of TaskGrader-suggested reviews become more pronounced, yielding a full 118% improvement over random at a 20% budget. It is also worth noting that with a random selection strategy, there is no benefit to second-level review: on average, randomly selecting tasks for a second review will catch fewer errors than simply reviewing a new task for the first time (as suggested by
Next we examine in more detail what is being changed by the two phases of review. We measure if reviewers are editing the same tasks and also how correlated the magnitude of the Review 1 and Review 2 changes are.
In order to measure the overlap between the most changed tasks in the two phases of review, we start with a set of 39,180 tasks that were reviewed twice. If we look at the 20% (approx. 7840) most changed tasks in Review 1 and the 20% most changed tasks in Review 2, the two sets of tasks overlap by around 25% (approx. 1960). We leave out the full results due to space restrictions, but this trend continues in that the most changed tasks in each phase of review do not meaningfully overlap until we look at the 75% most changed tasks in each phase. This suggests that Review 2 errors are mostly caught in tasks that were not heavily corrected in Review 1.
As another measure of the relationship between Review 1 and Review 2, we measure the correlation between the percentage of changes to a task in each review phase. The Pearson's correlation, which ranges from −1 (completely inverted correlation) to 1 (completely positive correlation), with 0 representing no correlation, was 0.096. To avoid making distribution assumptions about our data, we also measured the nonparametric Spearman's rank correlation and found it to be 0.176. Both effects were significant with a two-tailed p-value of p<:0001. In both cases, we find a very weak positive correlation between the two phases of review, which suggests that while Review 1 and Review 2 might correct some of the same errors, they largely catch errors on different tasks.
These findings support the hierarchical review model in an unintuitive way. Because we know review generally improves tasks, it is interesting to see two serial review phases catching errors on different tasks. This suggests some natural and exciting follow-on work. First, because Review 2 reviewers are generally higher-ranked, are they simply more adept at catching more challenging errors? Second, are the classes of errors that are caught in the two phases of review fundamentally different in some way? Finally, can the overlap be explained by a phenomenon such as “falling asleep at the wheel,” where reviewer attention decreases over the course of a sitting, and subsequent review phases simply provide more eyes and attention? Studying deeper review hierarchies and classifying error types will be interesting future work to help answer these questions.
Our results show that in crowd workflows built around macrotasks, a worker hierarchy, predictive modeling to allocate reviewing resources, and a model of worker performance can effectively reduce error in task output. As the budget available to spend on task review decreases, these techniques are both more important and more effective, combining to provide up to 118% improvement in errors caught over random spotchecking. While our features included a mix of domain-specific and generalizable features, using only the generalizable features resulted in a model that still had significant predictive power, suggesting that the Argonaut hierarchy and TaskGrader model can easily be trained in other macrotask settings without much task-specific featurization. The approaches that we present in this paper are used at scale in industry, where our production implementation significantly improves data quality in a crowd work system that has handled millions of tasks and utilized over half a million hours of worker participation.
Turning now to
Turning now to
Turning now to
This application claims priority to provisional application No. 62/212,989 filed on Sep. 1, 2015.
Number | Date | Country | |
---|---|---|---|
62212989 | Sep 2015 | US |