February 29, 2016
Filtering out irrelevant data before sampling
Pre-sampling filtering aims to discard the least relevant data elements. This filtering can be done by removing unwanted spam content appearing on social networks generated by spambots, an example would be to discard text data which doesn’t contain emoticons as social media robots rarely use them. An alternative strategy is to remove duplicates and exclude data elements that contain only a URL; if a data sample contains text in addition to a URL, it is considered acceptable, as the text provides further context and is more valuable for sentiment analysis. Moreover, by allowing only one atomic piece of social media data per user per time unit, it is possible to prevent a single user’s viewpoint skewing the results by dominating the sentiment analysis outputs for a given time unit. Limiting data samples to one per user per time unit will also dramatically reduce the influence of spambot messages.
Domain-specific social networks have their own particular issues which a filtering workflow needs to take account of. StockTwits, a social media platform designed for sharing ideas between investors and traders, is a perfect source to gauge the public sentiment towards financial securities. However, it suffers from specific spam techniques designed to manipulate the sentiment of a company or asset on the platform. Pump and dump scams, the fraudulent practice of encouraging investors to buy shares in a company in order to inflate the price artificially, has migrated from the prior circulation method of email and message board spam to social networks. Social networks are also a fertile ground for stock hoaxes with impersonation accounts of well known prominent market players spreading negative commentary which can cause declines in the shares value. The domain-specific filtering workflow needs to be able to detect these forms of manipulation along with custom filtering, for example, content on StockTwits would only be considered if only one cashtag is mentioned along with clear sentiment expressed in the message.
Exploring machine learning approaches to filtering is another possible area the consortium is interested in investigating.
Once the system has cleaned the input data and irrelevant elements are discarded, data sampling is applied.
Sampling Data for Training
Data Sampling Method
A hybrid sampling approach combining both a stratified random sampling method combined with a simple randomized selection of source data may generate more representative data than using a simple randomized selection method alone.
“A stratified random sample is obtained by separating the population into mutually exclusive sets, or strata, and then drawing simple random samples from each stratum. After the population has been stratified, we can use simple random sampling to generate the complete sample.”
Implementations can use just one stratification parameter, a time unit is a basic example (Figure 1).
Figure 1. Stratified Random Sample
Alternatively, many stratification parameters can be combined, such as time unit and stocks, this would be useful if the data was sourced from StockTwits. The idea is to gather random samples by stock per day (RS-Stock1day, RS-Stock2day, RS-Stockn day, etc) which will form a good base to random sample by stock per week then per month and so on. After the highest temporal granularity is reached, random samples by stock per year will be available. These provide a suitable foundation from which it is possible to to extract random sample data which will be representative of the entire population.
In the case of a stock of particular interest, a random sample specific to this stock can be built by taking random samples for this stock per day, then per week and so on.
Data Sample Size
Determining the size of the sample for analysis is important. Normally, the larger the sample size the more representative it is; the issue is that the cost of applying analysis or annotation or both to the sample is covariant with its size. As a proposed solution to this is using a method to determine the data sample size which takes in consideration the population’s size, the margin of error, confidence interval, the standard deviation and the confidence level.
To experiment with the sample size, it is simply required to decrease or increase the confidence level or the error margin. (http://success.qualtrics.com/rs/qualtrics/images/Deter-mining-Sample-Size.pdf)
As an example, with sample data with a size of 25,000, the margin of error set at 5%, the confidence level at 99% and a standard of deviation of 0.5, the number of samples that requires annotation is 542, which is 2.1% of the total amount of sample data.