March 25, 2016
Twitter is one of the platforms identified as data source within the SSIX project. Let’s see the different techniques that can be used to retrieve this data and the difficulties that derive from them.
SSIX decided to adopt Twitter as a primary source of information used to spot financial trends and to calculate the indices that will form the core of the final platform. The statistical relevance of the analysis performed on this social network has been repeatedly confirmed by many different studies.
Public information can be collected from Twitter using two main access points: REST APIs and Streaming APIs. The REST APIs provide programmatic access to read and write Twitter data. Specifically the Search API (https://dev.twitter.com/rest/public/search), that is part of Twitter’s REST API, can be used to retrieve tweets related to specific topics. This API searches against a sampling of recent Tweets published in the past 7 days and it’s important to know that the Search API is focused on relevance and not completeness.
On the other hand, public Twitter streams (https://dev.twitter.-com/streaming/public) offer samples of the public data flowing through Twitter. In our case, we adopted the technique of collecting data directly from the public streams, in particular using the endpoint statuses/filter (https://dev.twitter.com/streaming/reference/post/statuses/filter), that returns public statuses that match one or more filter predicates. This allow to intercept in real time the contents being published on Twitter.
While the documentation offered is well written and complete, however there are many technical aspects and issues that need to be tackled in order to set up a stable data ingestion environment. First of all there is a known limitation on the number of track keywords that can be used to open a stream. This limit is set to 400 and in case there is the need to track a higher number of keywords, different parallel servers and applications must be configured to avoid the blockage by Twitter.
Secondly, the software connecting to these streams must be well engineered and structured enough to deal with potential disconnections, hiccups or overflows caused by Twitter. For instance, the clients that read the data must ensure to consume the queues in a reasonable time, otherwise Twitter can drop the connections without any notice to avoid slowdowns on their side.
It is really important also to be aware that not all the contents being published on Twitter are passed through these streaming APIs, but only a small sample of the whole dataset is delivered. For instance if we use a very popular track keyword (e.g. “sex” or “apple”) we are getting only a small percentage of the contents posted on Twitter, otherwise we would incur in a quantity of data that is too large to manage.
The access to historical data is another issue, that cannot be reached using the traditional API access and can be overcome only by paying fees to third parties with direct partnership with Twitter, like Datasift or Gnip. These companies have access to the full Twitter firehose and are able to provide data even from 2006. For this reason they have always been under the spotlight and have been target of acquisitions by big players: Gnip has been acquired by Twitter in 2014 for $134 Million, while another strategical platform named Topsy was acquired by Apple in 2013 and shut down in 2015.
3rdPLACE shared his expertise and knowledge on this topic, providing the proper technologies used within the SSIX project to collect and manipulate different data sources available on the web.