ISDrama

Immersive Spatial Drama Generation through Multimodal Prompting

Anonymous Authors

Abstract. Multimodal immersive spatial drama generation focuses on creating continuous multi-speaker binaural speech with dramatic prosody based on multimodal prompts, with potential applications in AR, VR, and others. This task requires simultaneous modeling of spatial information and dramatic prosody based on multimodal inputs, with high data collection costs. To the best of our knowledge, our work is the first attempt to address these challenges. We construct MRSDrama, the first multimodal recorded spatial drama dataset, containing binaural drama audios, scripts, videos, geometric poses, and textual prompts. Then, we propose ISDrama, the first immersive spatial drama generation model through multimodal prompting. ISDrama comprises these primary components: 1) Multimodal Pose Encoder, based on contrastive learning, considering the Doppler effect caused by moving speakers to extract unified pose information from multimodal prompts. 2) Immersive Drama Transformer, a flow-based mamba-transformer model that generates high-quality drama, incorporating Drama-MOE to select proper experts for enhanced prosody and pose control. We also design a context-consistent classifier-free guidance strategy to coherently generate complete drama. Experimental results show that ISDrama outperforms baseline models on objective and subjective metrics.

Model Overview

In this paper, we first introduce MRSDrama, the first multimodal recorded spatial drama dataset, comprising binaural drama audios, scripts, videos, geometric poses, and textual prompts. The dataset includes 97.82 hours of speech data recorded by 21 speakers across three scenes. Next, we propose ISDrama, the first immersive spatial drama generation model based on multimodal prompting. ISDrama generates high-quality, continuous, multi-speaker binaural speech with dramatic prosody and spatial immersion, driven by multimodal prompts. To extract a unified pose representation from multimodal prompts, we design the Multimodal Pose Encoder, a contrastive learning-based framework that encodes not only position and head orientation but also radial velocity, accounting for the Doppler effect caused by moving speakers. Meanwhile, we develop the Immersive Drama Transformer, a flow-based Mamba-Transformer model capable of generating immersive spatial drama effectively and stably. Within this model, we introduce Drama-MOE (Mixture of Experts), which selects the appropriate experts to enhance prosodic expressiveness and improve pose control. Then, we adopt a context-consistent classifier-free guidance (CFG) strategy to ensure the quality and coherence of complete drama generation.

🎧🎧🎧 Please use ear phones to listen to the generated audio samples. 🎧🎧🎧

🎙️🎙️🎙️For Fair Comparison, all samples are resampled to 48kHZ.🎙️🎙️🎙️

Silent Video Generation Results

Geometric Pose Generation Results

Textual Prompt Generation Results

Silent Video Generation Results

In this section, we present generated samples of continuous multi-speaker binaural speech with dramatic prosody with silent video input.
We input the drama script as content, the prompt audios to specify timbres of different speakers, the silent video (and camera direction) as the pose prompt and scene information, then ISDrama generates the immersive spatial drama.

Generated Binaural Audio with Silent Video

Demo2: Waiting for Godot

Generated Binaural Audio with Silent Video

Demo3: Troilus and Cressida

Generated Binaural Audio with Silent Video

Geometric Pose Generation Results

In this section, we present generated samples of single binaural speech with dramatic prosody for better comparison with baseline models.
We input the drama script as content, the prompt audios to specify timbre, geometric pose (3D position and quaternion orientation) as the pose prompt, and scene information, then ISDrama generates the binaural audio.
The actual geometric pose is provided as frame-level sequences, and here we present the video for better understanding.

Demo1: Self-accusation

Audio Prompt Input

Audio Prompt:

Script Input

我寻找过机会，我没有利用机会。我没有顺应必然，我没有预料到偶然。我没有从反面典型中得到教训，我没有从历史中得到教训。

ISDrama(ours)	CosyVoice	FireRedTTS	F5-TTS

Audio Quality: Successfully learn timbre and prosody from audio prompts, and effectively generate regretful emotions that align with the script content;

Spatialization: Compared to other baselines, our model achieves smoother and more natural motion from left to right, providing enhanced pose perception.

Demo2: Measure for Measure

Audio Prompt Input

Audio Prompt:

Script Input

你就是闪缎，上好闪缎，真称得起是光溜溜的。我宁可作英国粗纱的花边，也不愿意像你这样头发掉得精光，冒充法国闪缎。这话说得够味儿吧。

ISDrama(ours)	CosyVoice	FireRedTTS	F5-TTS

Voice Quality: Successfully transfer timbre information from audio prompts to the synthesized audio, generating smooth and natural sentences with a hint of sarcastic and teasing emotion;

Spatialization: Our model successfully control the natural change in sound as the speaker moves from back to front, providing listeners with a more realistic pose perception.

我久久地沉默不语，惊恐于这场灾难，因为这灾难重大得使人难以说话或者细询问，但凡人对各种不幸都得忍受。

ISDrama(ours)	CosyVoice	FireRedTTS	F5-TTS

Voice Quality: Successfully transfer the timbre and pronunciation, achieving a deep and sorrowful emotion that aligns with the script content;

Spatialization: Our model effectively models the spatial information of the speaker pacing from right to left, achieving a natural and realistic spatial positioning effect.

Demo4: Waiting for Godot

ISDrama(ours)	CosyVoice	FireRedTTS	F5-TTS

Voice Quality: Successfully transfer the timbre and pronunciation demonstrated in the audio prompts, generating high-expressiveness speech with an anxious emotion that aligns with the script content;

Spatialization: Our model authentically reflects the speaker's rapid position changes moving from left to right, with the synthesized audio providing smoother and more natural spatial perception.

Textual Prompt Generation Results

In this section, we present generated samples of continuous multi-speaker binaural speech with dramatic prosody with textual prompts.
We input the drama script as content, the prompt audios to specify timbres of different speakers, textual prompt for each actor's line as the pose prompt, and scene information, then ISDrama generates the immersive spatial drama.

Demo1: Offending the Audience

Prompt Audio Input

Speaker1: