Crowdsourcing is a problem solving approach that uses the power of the crowd to perform certain tasks. Usually a task is divided into small parts also known as ’microtasks’. These are processed by one or more individual crowdworkers. Crowdsourcing is used in various areas of research like linguistics, astronomy and genetic genealogy. In recent years it has also gained great popularity in gathering training data for machine learning models.
We consider crowdsourcing to be a great possibility for the SSIX project to collect training data for sentiment analysis and named entity recognition.
In a series of four posts we will describe how crowdsourcing works, compare some commercial platforms and demonstrate how open source platforms can be used for a testrun.
How to set up Crowdsourcing – what to keep in mind
In order to create an easy, understandable and straight-forward crowdsourcing job there are a few things you have to keep in mind. We summed up the most important steps and explained each of them, while considering the task of sentiment annotation to make this easier for you:
- Define the job: Try to be clear on what data you actually want to collect. Is it reasonable to outsource this work to crowdworkers? Also think about the implementability and understandability of a user interface. In case of our example project we started by sketching a user interface.
- Describe the job: Here you should give the user a step by step description on how to process this job correctly. A short job description could start similar to the following: “In this job you will be presented tweets concerning companies in the stock market. Review the tweets to determine their sentiment regarding the company’s stock…”.
- Clarity of job description: Is the job description understandable? If yes, skip the next step. If no, proceed. In our case we considered the job description to be understandable and thus skipped the next step.
- Split job into smaller separate jobs: If your job description seems hard to understand, try to split the job into smaller pieces and come up with separate tasks. Consequently, one job may depend on other jobs which makes the whole process a bit slower. This is still preferable to pushing the crowdworkers too hard with highly complex tasks since the results may suffer in quality.
- Training data for crowdworkers: To give the crowdworkers a basic idea of what they should do, it is good practice to present them with training examples. After working through the training examples they should be confident in their task. We added pre-annotated tweets to the description in order to present them with examples.
- Testing data for crowdworkers: As soon as the workers have attended the training and read the description you may want to confirm if they are able to successfully process your job, in form of a test. If they do not pass, they are not allowed to contribute. This is important for getting good quality results which we will later describe in detail. Our crowdworkers had to answer 8 test questions before being able to contribute to the job.
- Settings: There are a few more things you have to define before you really can run the job.
- The number of people you want to have each question answered by: We decided to let each tweet be annotated by 10 people. This is quite a lot but we wanted to get a better feeling for these kind of jobs.
- How you want to aggregate the results: We chose to aggregate by confidence which the platform calculated based on several indicators.
- The percentage of correct answers people need to pass your test: Our workers needed more than 70% to pass.
- Run your job: Now that you have a well-defined and well-described job, you can start to run it. From now on the crowdworkers are able to contribute if they pass your test.
- Analyze results: After the job is finished, you can start to analyze your results. improve results Are you satisfied with the results? If yes, you are finished. If no, review each taken step and look for possible mistakes. You may want to run the revised job again. We were mainly satisfied with our results and will present them to you in the next posts.
It’s all about quality control
As mentioned, getting good quality data from crowdworkers can be problematic. There are many factors that may influence their work and thus result in bad quality. If your workers are paid by task they tend to work fast without thinking about the chosen answer. Also, your task may require special expertise, like the sentiment annotation of StockTwits tweets requires basic knowledge about stock market terminology. What you want is a crowd that is trustworthy in every way. We try to keep this blog post generic, but when reading this, you can think of sentiment annotation of social media messages as an example crowdsourcing job.
Following, we present some basic guidelines and rules to tackle these quality problems, especially when using a non-expert, paid crowd:
- Description: It is very important to present a detailed description of your task. This should include a general introduction to the topic, as well as a how-to for the chosen user interface. Your task description should further contain good examples, e.g. for sentiment annotation you should have at least one example representing each selectable answer.
- Entrance test: You want to reduce your crowd to those persons, that are actually able to understand the task at hand. Carefully put together a quiz that people have to pass before being able to start working on the job. The quiz is not meant to be easy, but fair and doable. When presenting them with multiple selectable answers you may want to accept more than one answer as correct, e.g. a range or all negative sentiment-options. It can also be a good idea to allow them a certain percentage of incorrect answers, since some answers can be very subjective or your job description could have flaws. All workers that pass this entrance test are for the time being considered trustworthy.
- Maintaining trust: Solely passing the entrance test is certainly not enough to get good results. If paid workers knew that they would just have to pass a quiz and remain unsupervised afterwards, they may take little to no effort in giving good answers. To prevent this from happening you need to randomly mix in pre-annotated test questions. Those can be annotated manually by yourself and people you trust. For future jobs you could gather answers with high agreement from previous tasks. Workers can be asserted a trust-level which rises on correct test answers and falls on incorrect ones. If the user falls below a certain trust threshold, he will further be forbidden from continuing to work on your task. Keep in mind that monitoring the worker trust plays an essential role in getting good results.
- Quality vs. redundancy: Another important question you may want to ask yourself is: How many annotations from different workers should each message get? For many, not so obvious questions it is very likely that one single worker is not enough to get a reliable answer. You may want to aggregate the answers by mean or majority or filter questions with low worker agreement. Of course this is a trade-off between cost and quality and the number of needed answers can vary strongly from question to question. One possible way to tackle this can be seen on the crowdsourcing platform Crowdflower. They allow the usage of dynamic judgment, such that questions with low worker agreement need to be answered more often than ones with high agreement.
- Feedback: Crowdworkers may not be experts in your special task, but they tend to be quite experienced with performing similar microtasks. Hence they know very well what a good job description and entrance tests should look like. Present them with the opportunity to give proper feedback. Additionally it could be a good idea to have some kind of “I don’t know”-Option on each question that requires a comment by the worker. This helps to identify difficult questions and get reason as to why they are so difficult.
As job author you should give the workers feedback on incorrect answered test-questions so that they can reflect on their mistakes or even object your answer in form of a comment. It may take some attempts and time until a crowdsourcing job is up and running smoothly, but mostly it is possible to get high quality results even on difficult tasks.
In the next blog posts of this series we will present a comparison of existing crowdsourcing platforms. They will be followed by a report of a small example job we ran on Crowdflower, in which we asked users to annotate the sentiment of StockTwits messages.
This blog post was written by SSIX partners Alfonso Noriega, Sebastian Strumegger, Sophie Reischl at Redlink GmbH.
For the latest update, like us on Facebook, follow us on Twitter and join us on LinkedIn.
P.-Y. Hsueh, P. Melville, and V. Sindhwani, “Data quality from crowdsourcing: A study of annotation selection criteria,” in Proceedings of the NAACL HLT 2009 Workshop on Active Learning for Natural Language Processing, HLT ’09, (Stroudsburg, PA, USA), pp. 27–35, Association for Computational Linguistics, 2009.
Finin, W. Murnane, A. Karandikar, N. Keller, J. Martineau, and M. Dredze, “Annotating named entities in twitter data with crowdsourcing,” in Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, CSLDAMT ’10, (Stroudsburg, PA, USA), pp. 80–88, Association for Computational Linguistics, 2010.ssix