Affective computing is an emerging field of research that aims to enable intelligent systems to recognize, feel, infer and interpret human emotions. It is an interdisciplinary field which spans from computer science to psychology, and from social science to cognitive science. Though sentiment analysis and emotion recognition are two distinct research topics, they are conjoined under the field of Affective Computing research. Emotions and sentiments play a crucial role in our daily lives.

They aid decision-making, learning, communication, and situation awareness in human-centric environments. Over the past two decades or so, AI researchers have been attempting to endow machines with cognitive capabilities to recognize, interpret and express emotions and sentiments. All such efforts can be attributed to affective computing. Emotion and sentiment analysis have also become a new trend in social media, avidly helping users to understand the opinion being expressed on different platforms. With the advancement of technology, abundance of smartphones and the rapid rise of social media, huge amount of Big Data is being uploaded as videos, rather than text alone. Consumers for instance, tend to record their reviews and opinions on products using a web camera and upload them on social media platforms like YouTube or Facebook to inform subscribers about their views.

The primary advantage of analyzing videos over textual analysis, for detecting emotions and sentiments from opinions, is the surplus of behavioral cues. Whilst textual analysis facilities only make use of words, phrases and relations, as well as dependencies among them, these are known to be insufficient for extracting associated affective content from textual opinions. Video opinions, on the other hand, provide multimodal data in terms of vocal and visual modality. The vocal modulations of opinions and facial expressions in the visual data, along with textual data, can provide important cues to better identify true affective states of the opinion holder. Thus, a combination of text and video data can help create a better emotion and sentiment analysis model.

These videos often contain comparisons of products from competing brands, the pros and cons of product specifications, etc., which can aid prospective buyers in making an informed decision. The aim of multi-sensor data fusion is to increase the accuracy and reliability of estimates Many applications, e.g., navigation tools, have already demonstrated the potential of data fusion. This depicts the importance and feasibility of developing a multimodal framework that could cope with all three sensing modalities: text, audio, and video, in human-centric environments. The ability of a multimodal system to outperform a unimodal system is well established in the literature. However, there is a lack of a comprehensive literature survey, focusing on recent successful methods employed in this research area. Unimodal systems are building blocks for a multimodal system, hence, we require them to be performing well in order to build an intelligent multimodal system.