Motion capture studio

Speech recording studio

One main goal of the team Expression consists in developing high quality voice synthesis. A good speech corpus quality relies on a consistent speech flow (\ie, the actor does not change his speaking style during a session) recorded in a consistent and quiet acoustic environment. In order to expand our research scope, it is often interesting to vary the speech style (dialogs, mood, accent, etc.) as well as the language style. Unfortunately, such corpora are hard to obtain and generally do not meet specific experimental requirements. To deal with these constraints, speech resources need to be recorded and controlled by our own protocols.

Hence, the team owns a speech studio located at ENSSAT in Lannion. The studio, in its material form, comes along with a software platform developed internally.

The recording studio consists in two rooms: an isolation booth and control room. The isolation booth can fit three persons. It is designed to attenuate the noises of 50dB and is equipped with two recording sets. A recording set consists in a high quality microphone (Neumann U87AI), a high quality closed head set (Beyer DT 880 250ohms), a monitor and a webcam. The control room is equipped with two audio networks, a video network and computer network.

The first audio network is a high quality digital recording line going from the isolation booth microphones to a digital sound card through a preamplifier (Avalon Design AD2022), an equalizer (Neve 8803 Dual Channel) and finally an analogic/digital converter (Lynx Aurora 8). The digital sound is edited with a logical sampling table (Avid Pro Tools). In addition to audio signal, Electro-Gloto Graphs (EGGs) can be captured from the actor. This activity is used to induce the F0 (first formant) trajectory which is the main indicator of the prosody.
The second audio network is for control purpose and is fully analogic. It is used by the operator to control the quality of the recorded sound, the consistency of the actor, the accuracy of the transcription. An actor can receive audio feedback of his own voice, disturbing stimulii (music, other voices, their own delayed voice) or directions from the operator through this audio line. This network consists in four Neumann KH 120 loud-speakers (two in the booth, two in the control room), a head set amplifier (ART headamp 6 pro) and an analogic sampling table (Yamaha MG206C).

Regarding software, recording sessions are orchestrated using a dedicated tool. Mainly, the role of this tool is to prompt actors in the isolation booth to utter speech with various indications (mood, intonation, speed, accent, role, …). The prompt is presented on a simple interface. Then, sound files are recorded, segmented and linked to the transcription. The whole process is controlled by the operator in real time. The latter can possibly reject (in fact, annotate) a file and prompt the actor again with the discarded sentence in case of mispronunciation, bad audio quality, etc. This software has been developed in C++ and relies on the Windows Audio and Sound API (WASAPI).

Here are some pictures of the studio.

Hardware platforms

Motion capture studio

Speech recording studio