July 6, 2016
Sentiment analysis (SA) is an example of a Natural Language Processing (NLP) application which attracts a lot of attention due to the general commercial interest and potential around it. Despite the big buzz around sentiment analysis, there is still a major gap in the understanding of the main technical components behind it. This series of posts aims to provide a summarized account on the key elements behind the construction of a contemporary sentiment analysis engine.
Taking some time to understand the key components behind sentiment analysis is a fundamental step for anyone doing technical decision making with regard to buying or building a sentiment analysis engine. Addressing a sentiment analysis problem goes way beyond the construction of a machine learning classifier and requires an understanding of many linguistic aspects which are at play.
Defining your Sentiment Analysis Problem
The first step in the construction of any sentiment analysis engine is to define clearly the type of problem which your sentiment analysis engine will target. SA comes into different flavors and will demand different types of NLP techniques and resources. In fact, the synonymic term for SA, opinion mining, better reflects the broader nature of the task which is associated with sentiment analysis.
– Classification Type (Simple, Aspect-based, Comparative vs Sentiment Analysis)
Certain sentiment analysis scenarios demand the identification in text of the specific entity (object) and the aspect (attribute or feature) which the sentiment refers to. This requires the application of information extraction (IE) techniques into the target text in order to individuate these elements, adding a significant layer of complexity to the sentiment analysis process. This type of sentiment analysis, called aspect-based sentiment analysis, is commonly applied to online product reviews, in which target objects (a specific camera product, for example) and their different aspects (e.g. its luminosity capture) are classified according to a specific sentiment category or class.
Other types of sentiment analysis target a general aggregate assessment of the polarity attached to an entity (a politician, artist or a brand), which is identified in a simple fashion. Usually this is the entry point for doing sentiment analysis and defines a coarse-grained type of sentiment analysis.
Another possible variation include the presence of comparative opinions (versus regular opinions), i.e. outputting comparisons between different entities and aspects.
– Polarity Granularity
Consists in the types and granularities of classes which will be the target of the classification task. Typical class schemes vary from 3 (positive, neutral, negative) classes up-to 5 classes (very positive positive, neutral, negative, very negative). The meaning of the classes typically vary across different domains of discourse (bullish, neutral and bearish, for the financial domain, and 5 star ratings for product reviews).
– Discourse Granularity (Opinion Target)
Depending on the domain of discourse, the typical size of the text which needs to be analyzed can vary significantly, ranging from a tweet to a full text (a long product review for example). Additionally, event for large corpora, it is possible to define different levels of analysis granularities. Typical levels are: document level, sentence level and entity/aspect-level.
– Subjectivity Level
Depending on the type of analysis which is being aimed at, it can be useful to differentiate between subjective and objective types of sentences. While an objective sentence expresses factual information, a subjective sentence expresses some personal opinions, beliefs or feelings. This separation is not always clear, as some opinionated information can be communicated in a more factoid type of discourse. For example, a technical comparative analysis between different product attributes will primarily target objective types of discourse while general brand perception analyses may be more focused on subjective discourse types.
– Discourse Attributes (Formality, Language)
Additionally, the domain of discourse will define the level of language formality (presence of slangs, abbreviations), which can impact the quality of sentiment analysis. Another factor to take into account is the set of target languages which will be addressed by the sentiment analysis, which can be mono-lingual (focus on a single language) or multi-lingual (target multiple languages).
Figure 1 summarizes the set of core categories which can be used to define the type of target sentiment analysis and indicates the level of complexity. Understanding these classes are a fundamental for a proper scoping of the problem, as these categories deeply relate to the type and the complexity of NLP strategy which will need to be employed.
Different attributes associated to the sentiment analysis task and their associated complexity.
In the next posts we will start looking into the state-of-the-art techniques for sentiment analysis.ssix