To train our emotion recognition algorithms, we need speech databases that consist of speech recordings that are annotated for the speaker’s intended emotion and/or for the emotion perceived by a panel of listeners.

We currently use two databases.

The first database we use here is Emo_DB (Burkhardt, Paeschke, Rolfes, Sendlmeier & Weiss 2005). It has recordings of ten actors, each speaking a number of sentences with neutral semantic content, produced with each of seven intended emotions: happiness, sadness, anger, fear, neutral, disgust and boredom. Recordings for which a panel of listeners does not agree that the perceived emotion equals the intended emotion, were not included in the database.

The second database is the Surrey Audio-Visual Expressed Emotion (Savee) database (Haq, Jackson & Edge 2008). It has recordings of four actors, each performing 120 utterances distributed over seven intended emotions: happiness, sadness, anger, fear, neutral, disgust and surprise. Their recognition was 69% for four emotions (cf. 76% for human listeners), and 56% for all seven emotions (cf. 67% for human listeners).