April 4, 2016
As discussed in a previous post, Social Network data will be one of the main sources of the SSIX Sentiment Analysis platform for Financial domains.
It is well known that the amount of data made available from such sources is huge, and the first challenge is to be able to filter relevant content out of the massive amount of data that can be retrieved. Defining relevance is per se a challenging task, but in the context of the SSIX project it has at least the following facets:
- We want content that pertains to specific domains (finance, economy, users’ pools on products and companies)
- We want to retrieve data concerning specific products, companies, people, or events
- We want to filter reliable information (avoid spam in the first place, but also be able to restrict the search to domain experts’ opinions)
- We want to select a specific locale as the source of the contents (for instance only French or German content)
- We want to see a representative sample of positive and negative discussions
Taking into account all these aspects is essential to providing an effective platform for monitoring financial and economic trends across Social Networks.
Research in the SSIX consortium is currently focusing on the best way to retrieve content related to Companies, Products, People and Events. The task is technically called Named Entity Recognition and is an integral component of Information Retrieval, namely the task of retrieving content relevant to a given query.
In order to succeed, IR cannot simply rely on pattern matching based on string surface form in the query (the way the query is typed). Different sources may adopt different ways of referring to the same entity, and this is even more pervasive in social network content, where besides naming variants depending on the specific linguistic domain and locale, we are faced with spelling variants, including misspelling of names.
For instance, while string-based filtering may work reasonably well with companies like “Microsoft” and “Google”, in many cases, especially with multiword company names, the filtering may discard most of the relevant data as the full company name may not be used in the social network content. In the most extreme case, the expected company name may not appear at all in the relevant data. For example, the Finnish company “Nokian Renkaat” is referred to as “Nokian Tyres” in English data and the hashtags used in the tweets are based on the English name (e.g. @NokianTyres, @NokianTyresCom). Finally, everyone had experience with the mistyping of names in social networks.
The ability of collecting naming variants to their canonical form, to overcome language barriers related to the locale, to recognise names despite possible misspellings are the challenges we are currently dealing with in the Named Entity Recognition component of SSIX.
The idea that is being currently pursued is to run a new adaptive named entity recognition system on the raw social network data capable of identifying both known entities and previously unseen names, collecting the latter under a canonical form and assigning this to an appropriate type (e.g. Product, Company, Person etc.) and linking entities through meaningful relations (e.g. Product is_owned_by Company).ssix