The CMU-MOSI dataset is a dataset rich in sentimental expressions where 93 people review topics in English. The videos are segmented with each segments sentiment label scored between +3 (strong positive) to -3 (strong negative) by 5 annotators. We took the average of these five annotations as the sentiment polarity and, hence, considered only two classes (positive and negative). The train/validation set consists of the first 62 individuals in the dataset. The test set contains opinionated videos by rest 31 speakers. In particular, 1447 and 752 utterances are used in training and test, respectively.

Mail to Amir Zadeh(


The Multimodal Opinion Utterances Dataset (MOUD) was developed in 2013. This is a dataset of utterances, with all videos recorded in Spanish. A final set of 80 videos was selected, out of which 65 were from female speakers and 15 from male speakers, with age ranging from 20 to 60 years. A multimodal dataset of 498 utterances was eventually created with an average duration of 5 seconds and a standard deviation of 1.2 seconds. The dataset was annotated using Elan, an annotator tool used for video and audio sources, along with two other annotators. The annotation task led to 182 positive, 231 negative and 85 neutral labeled utterances. There were 28 features considered for computation in total, including: prosody features, energy features, voice probabilities and spectral features. This trimodality dataset is said to produce an error rate reduction of 10.5% compared to the best unimodality set

Public (Link)


The Institute for Creative Technologies MultiModal Movie Opinion (ICT-MMMO) database was developed in 2013. This dataset is a collection of online videos obtained from YouTube and ExpoTV reviewing movies in English. The authors used keywords such as movie, review,ideos and opinions, and the names of recent movies as listed by, as search keywords. The authors collected 308 YouTube videos, out of which 228 were annotated as positive, 57 as negative and 23 as neutral. They also gathered 78 movie review videos from ExpoTV, from which 62 were annotated as negative, 14 as neutral and 2 as positive. The final dataset comprised a total of 370 videos, which included all 308 videos from YouTube and 62 negative movie review videos from ExpoTV. The annotation task was performed by two annotators for YouTube videos and one annotator for ExpoTV videos. In contrast with other datasets, this dataset had five sentiment labels: strongly positive, weakly positive, neutral, strongly negative and weakly negative.

Mail to Giota Stratou (

Youtube dataset

This dataset was developed in 2011. The idea behind its development is to capture the data present in the increasing number of videos posted online every day. The authors take pride in developing the first publicly available dataset for tri-modal sentiment analysis, by combining visual, audio and textual modalities. The dataset was created by collecting videos from YouTube that are diverse and multimodal and have ambient noises. The keywords used for the collection of videos are opinion, review, best perfume, tooth paste, business, war, job, I hate and I like. Finally, a dataset of 47 videos was created, out of which 20 were from female speakers and the rest male, with their ages ranging from 14 to 60 years. All speakers expressed their views in English and belonged to different cultures. The videos were set to .mp4 format with a size of 360x480. The 47 videos in the dataset were further annotated with one of three sentiment labels: positive, negative or neutral. This annotation task led to 13 positively, 12 negatively and 22 neutrally labeled videos.

Mail to Giota Stratou (

Belfast database

This dataset was developed in 2000. The database consists of audiovisual data of people discussing emotional subjects and are taken from TV chat shows and religious programs. It comprises 100 speakers and 239 clips, with 1 neutral and 1 emotional clip for each speaker. Two types of descriptors were provided for each clip – dimensional and categorical. Activation and evaluation are dimensions that are known to discriminate effectively between emotional states. Activation values indicate the dynamics of a state and evaluation values provide a global indication of the positive or negative feelings associated with the emotional state. Categorical labels describe the emotional content of each state.

On Registration (Link)


This dataset was developed in 2006. It is an audio-visual developed for use as a reference database for testing and evaluating video, audio or joint audio-visual emotion recognition algorithms. This database elicited universal emotions of happiness, sadness, surprise, anger, disgust and fear with the help of 42 speakers, from 14 different nationalities.

Public (Link)


This dataset was developed in 2007. It is a large audiovisual database created for building agents that can engage a person in a sustained and emotional conversation using a Sensitive Artificial Listener (SAL) [31] paradigm. SAL is an interaction involving two parties: a 'human' and an 'operator' (either machine or a person simulating a machine). The interaction is based on two qualities: one is low sensitivity to preceding verbal context (the words the user used that do not dictate whether to continue the conversation) and the second is conduciveness (response to a phrase by continuing the conversation). There were 150 participants, 959 conversations, each lasting 5 minutes. There were 6-8 annotators per clip, who eventually traced 5 affective dimensions and 27 associated categories. For the recordings, the participants were asked to talk in turn to four emotionally stereotyped characters. The characters are Prudence, who is even-tempered and sensible; Poppy, who is happy and outgoing; Spike, who is angry and confrontational; and Obadiah, who is sad and depressive. Videos were recorded at 49.979 frames per second at a spatial resolution of 780 x 580 pixels and 8 bits per sample, while audio was recorded at 48 kHz with 24 bits per sample. To accommodate research in audio-visual fusion, the audio and video signals were synchronized with an accuracy of 25microseconds.

On Registration (Link)


Motion Capture Database (IEMOCAP). IEMOCAP dataset was developed in 2008. 10 actors were asked to record their facial expressions in front of cameras. Facial markers, and head and hand gesture trackers were applied in order to collect facial expressions, and head and hand gestures. In particular, the dataset contains a total of 10 hours recording of dyadic sessions, each of them expressing one of the following emotions: happiness, anger, sadness, frustration and neutral state. The recorded dyadic sessions were later manually segmented at utterance level (defined as continuous segments when one of the actors was actively speaking). The acting was based on some scripts, hence, it was easy to segment the dialogs for utterance detection in the textual part of the recordings. Busso et al. [32] used two famous emotion taxonomies in order to manually label the dataset at utterance level: discrete categorical-based annotations (i.e., labels such as happiness, anger, and sadness), and continuous attribute-based annotations (i.e., activation, valence and dominance). To assess the emotion categories of the recordings, six human evaluators were appointed. Having two different annotation schemes can provide complementary information in human-machine interaction systems. The evaluation sessions were organized so that three different evaluators assessed each utterance. Self-assessment manikins (SAMs) were also employed to evaluate the corpus in terms of the attributes valence [1-negative, 5-positive], activation [1-calm, 5-excited], and dominance [1-weak, 5-strong]. Two more human evaluators were asked to estimate the emotional content in recordings using the SAM system. These two types of emotional descriptors facilitate the complementary insights about the emotional expressions of humans, emotional communications between people which can further help develop better human-machine interfaces by automatically recognizing and synthesizing emotional cues expressed by humans.

On request (Link)



SenticNet is an initiative conceived at the MIT Media Laboratory in 2009 within an industrial Cooperative Awards in Science and Engineering (CASE) research project born from the collaboration between the Media Lab, the University of Stirling, and Sitekit Solutions Ltd. Since then, SenticNet has been further developed and applied for the design of emotion-aware intelligent applications in fields spanning from data mining to human-computer interaction. The main aim of SenticNet is to make the conceptual and affective information conveyed by natural language (meant for human consumption) more easily-accessible to machines. This is done by using the bag-of-concepts model, instead of simply counting word co-occurrence frequencies as in latent semantic indexing, and by leveraging on linguistic patterns, to allow sentiments to flow from concept to concept based on the dependency relation between clauses.
For more information visit