7 Steps to Crowdsourcing Data for Machine Learning Models

As machine learning (ML) becomes increasingly mainstream, companies and investors are looking to apply it to a wider variety of problems. This is challenging because ML models are inherently problem specific; there is no ML model that is universally applicable. In most cases, every new problem requires the developer to design and train a new ML model specifically tailored to that problem. That being said, it is possible to apply the knowledge we gained to solve similar problems and use small training sets for tuning parameters of the model.

The reasoning for this is relatively intuitive. For a moment, let’s consider how a human baby learns. Imagine she has seen many examples of apples and bananas and is able to identify them well. Now imagine, she is presented with a tomato. She would not be expected to correctly identify this new fruit. However, the knowledge that she’s acquired from distinguishing apples from bananas can be useful to identify the tomato. By continually learning from a few new samples, the baby would be able to solve the new problem relatively easily.

Returning to the concept of ML models, addressing new problems may require us to start from very little data (similar to the baby) or no data at all. In these cases, how do we benefit from ML models to solve the problem?

Crowdsourcing platforms are here to fill in the gap. By splitting tasks into lots of simple and independent micro-work, we can generate, clean, and verify large amounts of data. However, as would be expected, this data may be lower quality than our normal standards. To resolve this issue, this article will dive into some practical tips of quality control for crowdsourcing ML data.

1. Make each task simple

Simplicity can often guarantee the quality of results. To accomplish this, decompose complex tasks into several simple tasks. For example, try to avoid instructions similar to “read a 10-page document and answer the following 20 questions”. Instead, decompose the larger task into micro-work. Using this method, the example above would become several smaller tasks with the following instructions: “read 2 paragraphs and answer 2 questions on the paragraphs.”

Similarly, try to avoid tasks that involve selecting the correct category from a large list of categories. Reframe this task to ask whether the example provided belongs to a specific category. Alternately, you could use a multi-step process, such as the following:

Partition several categories into groups of categories,
Ask the user to select the right category group, and, lastly,
Ask the user to select the right category within the chosen group.

2. Provide clear instructions

Clearly and concisely articulate the task description and exception. If possible, provide good and bad examples to illustrate the expected answers of the task. It’s also helpful to balance the length of the instructions with task complexity and the total number of assignments. For example, a worker would not be interested in reading a 10-minute instruction for a 5-minute task or to complete two assignments that pay $0.50 each.

3. Assign the tasks to a specific group of workers

Some crowdsourcing platforms collect workers’ personal information such as location, language, age, etc., and provide metrics on worker performance. Check whether this information is available as it often helps improve the quality of results. Note that this option often involves an additional fee, so be aware that you may be increasing your costs.

4. Use qualifying questionnaires or hold short interviews

In the absence of filters on a website, you can perform your own filtering through qualifying questionnaires or short interviews. However, be aware that this may deter some qualified workers.

5. Verify results by assigning the same task to multiple workers

This may sound redundant, but ensuring you have quality data for your ML model is crucial. To verify the results of simple tasks, 2 workers may be enough. However, consider assigning 3 or more workers to the task if they are more complex.

6. Sprinkle in known answers to verify results

Another method to verify the quality of results is assigning tasks to which you already know the answers. For example, sprinkle in some tasks or questions with known results and only accept the data if the accuracy of the known questions passes a pre-determined threshold.

7. Do in-house verification

Don’t collect all the data that you need at one time. Begin by collecting it in a small batch. This will allow you to verify if and how the results deviate from expectations. Adjust the data collection strategy accordingly, and then continue.

A Final Thought

Crowdsourced data can be costly, especially with large amounts. When considering crowdsourcing, always check whether there are other tools or existing sources of data you can use, which will reduce the amount of data collected by crowdsourcing. For example, there may be others trying to solve a similar problem or who have already collected data. Consider the possibility of sharing or buying the data from them. And don’t forget web crawlers and ML automatic annotators. You can use these tools to do part of the job and reduce the workload passed to crowd workers.