NVIDIA Releases PersonaPlex-7B-v1: A Real-Time Speech-to-Speech Model Designed for Natural and Full-Duplex Conversations

NVIDIA Releases PersonaPlex-7B-v1: A Real-Time Speech-to-Speech Model Designed for Natural and Full-Duplex Conversations

By Asif Razzaq
Publication Date: 2026-01-18 06:48:00

NVIDIA Researchers released PersonaPlex-7B-v1, a full duplex speech to speech conversational model that targets natural voice interactions with precise persona control.

From ASR→LLM→TTS to a single full duplex model

Conventional voice assistants usually run a cascade. Automatic Speech Recognition (ASR) converts speech to text, a language model generates a text answer, and Text to Speech (TTS) converts back to audio. Each stage adds latency, and the pipeline cannot handle overlapping speech, natural interruptions, or dense backchannels.

PersonaPlex replaces this stack with a single Transformer model that performs streaming speech understanding and speech generation in one network. The model operates on continuous audio encoded with a neural codec and predicts both text tokens and audio tokens autoregressively. Incoming user audio is incrementally encoded, while PersonaPlex simultaneously generates its own speech, which enables barge in, overlaps, rapid turn taking, and contextual backchannels.

PersonaPlex runs in a dual stream configuration. One stream tracks user audio, the other stream tracks agent speech and text. Both streams share the same model state, so the agent can keep listening while speaking and can adjust its response when the user interrupts. This design is directly inspired by Kyutai’s Moshi full duplex framework.

Hybrid prompting, voice control and role control

PersonaPlex uses two prompts to define the conversational…