The Robot's Inner Critic

🎉 Accepted to ICRA 2026! 🤖✨

IEEE International Conference on Robotics and Automation, Vienna 🇦🇹

[Video]

Abstract

Conventional robot social behavior generation has been limited in flexibility and autonomy, relying on predefined motions or human feedback. This study proposes CRISP (Critique-and-Replan for Interactive Social Presence), an autonomous framework where a robot critiques and replans its own actions by leveraging a Vision-Language Model (VLM) as a `human-like social critic.' CRISP integrates (1) extraction of movable joints and constraints by analyzing the robot's description file (e.g., MJCF), (2) generation of step-by-step behavior plans based on situational context, (3) generation of low-level joint control code by referencing visual information (joint range-of-motion visualizations), (4) VLM-based evaluation of social appropriateness and naturalness, including pinpointing erroneous steps, and (5) iterative refinement of behaviors through reward-based search. This approach is not tied to a specific robot API; it can generate subtly different, human-like motions on various platforms using only the robot's structure file. In a user study involving five different robot types and 20 scenarios, including mobile manipulators and humanoids, our proposed method achieved significantly higher preference and situational appropriateness ratings compared to previous methods. This research presents a general framework that minimizes human intervention while expanding the robot's autonomous interaction capabilities and cross-platform applicability.

Key Contributions

🧠

VLM-based Autonomous Self-Refinement

We introduce an autonomous "generate-evaluate-regenerate" cycle where a VLM acts as a "social critic". This allows robots to self-assess and refine behavior without explicit human feedback.

⚙️

Flexible Low-level Motion Generation

By implementing control directly from MJCF structural files, we bypass rigid APIs. This enables natural, subtly varied motions that are not tied to any specific platform.

🤖

Broad Cross-Platform Applicability

CRISP was validated across 20 scenarios on five robot platforms, including humanoids and mobile manipulators. User studies confirmed significant improvements in social appropriateness.

CRISP Framework Overview

CRISP is an autonomous framework that leverages a Vision-Language Model (VLM) as a "social critic" to evaluate and refine robot behaviors. It enables robots to generate naturally social and platform-agnostic motions through an iterative self-refinement loop.

Evolution of Social Behavior

How CRISP refines a robot's motion from an awkward initial guess to a natural social gesture.

Fig 8. Evolution of a plan through replanning. The boxed steps indicate modified actions with VLM-assigned reward scores.

CRISP optimizes robot behavior through the Reward-based Adaptive Search (RAS) algorithm. The process begins with an initial behavior plan generated by the LLM. The Motion Evaluator then captures keyframes of the execution and leverages a VLM to assign a social appropriateness reward between 1 and 10.

🎯 Reward-based Search: Based on the VLM's score, the system adaptively adjusts the search width—broadening the search for low rewards and fine-tuning for high rewards. The algorithm terminates once the behavior achieves a high reward (R ≥ 8).
✨ Iterative Refinement (Fig 8): As shown in the example, the initial plan (Row 1) undergoes revisions where specific erroneous steps (e.g., Steps 2 and 5) are pinpointed and modified. Each iteration increases the motion's naturalness and context-appropriateness until it reaches a successful state.

Experimental Results

Quantitative evaluation across 5 robot platforms and 20 interaction scenarios.

1. Overall Performance

Our human-subject study (N=50) demonstrated that CRISP (4.5 ± 1.11) significantly outperformed the baseline GenEM (3.4 ± 1.13) and the non-replanning version (3.79 ± 1.16).

*Statistically significant differences were confirmed for all pairs (p < .001) using Wilcoxon signed-rank tests with Holm-Bonferroni correction.

2. Detailed Scenario-wise Analysis

Performance consistency across 4 distinct scenario categories:

Scenario 1: Simple Greetings

Scenario 2: Emotional Responses

Scenario 3: Specific Commands

Scenario 4: Complex Emotions

The CRISP framework consistently achieved superior scores across all platforms. The advantage is particularly evident in high-DOF robots like the Unitree G1 humanoid, where autonomous visual feedback is essential for refining complex social postures.

BibTeX

@inproceedings{lim2026robot,
    title={The Robot's Inner Critic: Self-Refinement of Social Behaviors through VLM-based Replanning},
    author={Lim, Jiyu and Yoon, Youngwoo and Park, Kwanghyun},
    booktitle={Proceedings of the IEEE International Conference on Robotics and Automation (ICRA)},
    year={2026},
    eprint={2603.20164},
    archivePrefix={arXiv},
    primaryClass={cs.RO},
    note={To appear}
  }