Sentiment analysis, particularly the automatic analysis of written reviews in terms of positive or negative valence, has been extensively studied in the last decade. On the other hand, automatic speaker’s sentiment extraction for audio streams containing spontaneous speech (like online videos) is a challenging area of research that has received little attention in the last few years.
Given the availability of social media websites such as YouTube where more than 300 hours of videos are published every minute, this opens an interesting opportunity for computing sentiment analysis by combining textual, audio and speech-based emotion recognition features.
Sentiment analysis for an audio-visual context interestingly fuses three research fields:
- visual emotion recognition: since usually only one person is present in each video clip and he is most of the time facing the camera, current technologies for facial tracking could be efﬁciently applied such as OMRON’s OKAO Vision System which detects at each frame the face, extracts the facial features and extrapolates some basic facial expressions as well as eye gaze direction;
- audio feature extraction: it allows to convert voice into text as well as extract the emotions (satisfied, unhappy, curious, neutral, excited) of the speaker when delivering the video.
- text-based analysis and classification: Bag Of Words and Bag Of N-Grams features could be used for data-based linguistic sentiment classification applying n-gram features, term frequency, inverse document frequency transformation and document length normalization to the texts.
As a tangible example, Wollmer et al. have proposed an innovative technique for sentiment analysis from a video data-set containing 359 YouTube clips in which non-professional people express their opinions on a selection of movies they watched previously. Experimental results indicate that training on audio and textual features fused with language-independent audio-visual analysis further improve the analysis. As a matter of fact their hybrid cross-corpus n-gram analysis has lead to remarkably high F1-measures (weighted average of precision and sensitivity of the model) of up to 71.3%.
Here at 3rdPLACE we found this work very fascinating and rich of potentials, hence we started studying this new concept of audio-visual sentiment analysis in social media.
 Martin Wollmer, Felix Weninger, Bjorn Schuller, Tobias Knaup, Congkai Sun, Kenji Sagae, Louis-Philippe Morency [YouTube Movie Reviews: Sentiment Analysis in Audiovisual Context (ResearchGate, May 2013)]