Improving sentiment analysis via crowdsourcing: A testrun with CrowdFlower and StockTwits, and Pybossa
March 5, 2018
In order to test our assumptions and additionally get a better feeling for the crowdsourcing community we launched a very small-scaled job on CrowdFlower. The job was designed to let the crowd judge the sentiment of tweets from StockTwits regarding the mentioned company and it’s stock.
To begin with, we collected 59 tweets for different companies. They were picked by two criterions: 8 tweets with clear positive or negative sentiment for testing our users and 51 tweets with more or less obvious sentiment for collecting the real annotation data. Hence, the task was for contributors to annotate 51 tweets.
We then started to build our job by writing a description for the contributors to understand their task. CrowdFlower provides an easy-to-use editor for creating descriptions.
The next step was to design an interface for annotating tweets. The messages were presented with StockTwits’s embedding widget. We hid “Bearish” and “Bullish” flags since expert-contributors who voluntarily pre-tested our job told us to remove them. They indicated that those flags would make the task too easy, so no worker would take effort to actually read the message and just mark the average positive/negative sentiment according to the flag.
This is what the result looked like:
Afterwards we defined the job settings and which questions are to be used as test questions.
Some important settings:
Contributors: Only level 3 contributors were allowed to apply for our job.
Quality: Throughout the task, workers had to stay above 70% accuracy on test questions.
Rows per Page: The minimum work a contributor has to finish in order to get paid is one page. We defined one page to consist of 10 tweets.
Judgments per Row: We defined each of the 51 tweets to need 10 judgments by different workers. For a real job 10 would probably be way too much and too expensive, but we wanted to have lots of opinions on each tweet in order to check the user agreement manually.
Payment: Contributors were paid $0,02 per judgment. We do not know if this is too much, but according to the received worker feedback our payment was above average.
In less than a day our job had been finished, consuming $21,96.
78 contributors passed the entrance test and 178 of them failed. After the first few minutes we started getting feedback by contributors that our test questions may be too difficult or unfair due to the subjectivity of the task. Hence we made small adjustments.
Throughout the job another 25% of the people failed to stay above 70% accuracy and became untrusted users.
Unfortunately, no contributor was able to give more than 20 judgments, since we had too few test questions and each test question may only be presented once per contributor. We would definitely consider this problem next time and create more test questions.
The full report on the results is very detailed, including each judgment from each contributor. We have picked important values from the reports and summarized them in a document. Follow this link to see results for all processed tweets: https://goo.gl/wQqp46.
The results look quite promising. For more accuracy it would help to give contributors more options to choose from, or a slider. Some workers seem to choose the average negative or positive value all too often. They may feel safe to also pass the test questions with this strategy. So we definitely need more options and more test questions with extreme cases.
All in all, we would rate the crowdsourcing-way of gathering data for machine learning as very promising, even on difficult tasks. The keys to high quality results are the right quality control and good descriptions.
For the SSIX project we also came up with a more general interface for the combination of entity recognition and sentiment annotation in one job. The interface was included into an open source crowdsourcing framework called Pybossa.