ProAct: A Dual-System Framework for Proactive Embodied Social Agents

cars peace

ProAct is a proactive, real-time framework for embodied agents that generates coherent verbal and nonverbal behaviors by unifying low-latency behavioral control with long-horizon social reasoning.

Introduction

Embodied social agents have recently advanced in generating synchronized speech and gestures. However, most interactive systems remain fundamentally reactive, responding only to current sensory inputs within a short temporal window. Proactive social behavior, in contrast, requires deliberation over accumulated context and intent inference, which conflicts with the strict latency budget of real-time interaction.

We present ProAct, a dual-system framework that reconciles this time-scale conflict by decoupling a low-latency Behavioral System for streaming multimodal interaction from a slower Cognitive System which performs long-horizon social reasoning and produces high-level proactive intentions. To translate deliberative intentions into continuous non-verbal behaviors without disrupting fluency, we introduce a streaming flow-matching model conditioned on intentions via ControlNet. This mechanism supports asynchronous intention injection, enabling seamless transitions between reactive and proactive gestures within a single motion stream. The system is deployed on a physical humanoid robot.



Dual-System Overview

System Overview

ProAct incorporates two systems to reconcile low-latency reactive control with longer-horizon deliberation: a fast Behavioral System for streaming multimodal interaction, and a Cognitive System for compressed-context reasoning and proactive planning.

The Behavioral System maintains the real-time interaction loop, producing streaming verbal and non-verbal responses through a cascaded architecture that connects a streaming omni-modal LLM with a streaming motion generator. The two channels operate asynchronously: verbal responses follow a turn-based pattern, while motion generation runs continuously to maintain embodied presence throughout the interaction.

The Cognitive System provides slower, LLM-driven deliberative reasoning of the context in parallel with real-time interaction: it compresses accumulated history via a Context Encoder into a bounded memory and uses a Behavior Planner to assess motivation and plan actions. These components run continuous reasoning cycles within a fixed time budget to keep inference timely and consistent.



Flow-Matching Based Motion Generator

We introduce a real-time motion synthesis framework that utilizes Conditional Flow Matching (CFM) to transform Gaussian noise into motion via a learned velocity field. This approach employs an optimal-transport path and a transformer-based backbone to model dyadic interactions, processing synchronized audio streams to generate both expressive speaker gestures and attentive listener behaviors. To ensure temporal continuity during streaming, an overlap-and-cache scheme is implemented, effectively eliminating boundary discontinuities between windows.

To incorporate high-level semantic intentions without degrading rhythmic synchronization, the model integrates a disentangled ControlNet architecture that decouples text-based control from the frozen, audio-driven base generator. This unified system enables seamless transitions between reactive and proactive behaviors while maintaining a generation speed faster than real-time playback.

Sample Results on Specific Tasks

Ablation on Cognitive System and Comparison of Different User Emotions

Ablation of Motion Generation Methods

Case Example Breakdown

Demonstrations of ProAct intermediate outputs on the specific tasks are shown below to illustrate how the system generates proactive behaviors through the dual-system architecture. Case 3 further compares the system with the one without the Cognitive System to highlight the effectiveness of the dual-system design in generating timely and context-aware proactive behaviors.

Case 1

Case 2

Case 3