**MOMO:** Seamless Physical, Verbal, and Graphical Robot Skill Adaptation

Markus Knauer

MOMO: Seamless Physical, Verbal, and Graphical Robot Skill Adaptation

April 24, 2026

LLM • Robotics • Imitation-Learning • Natural-Language • Multi-Modal • KMP • Ergodic-Control • Virtual-Fixtures • Human-Robot-Interaction

What if non-expert operators could adapt an industrial robot’s skills as easily as touching, talking to, or clicking on it? Our framework MOMO (Motion Modulation) unifies three complementary interaction modalities for seamless robot skill learning and adaptation.

Overview of the MOMO framework unifying physical, verbal, and graphical modalities for robot skill adaptation

MOMO integrates five components around a central motion modulation module:

Kinesthetic touch for precise spatial corrections, using energy-tank-based human intention detection that automatically inserts via-points into the underlying motion model.
Natural language for high-level semantic modifications through a tool-based LLM architecture that selects and parameterizes pre-validated functions — never generating executable code.
A graphical web interface (Vue.js/Three.js) for visualizing geometric relations, inspecting parameters, and editing via-points via drag-and-drop on a real-time digital twin.
Probabilistic Virtual Fixtures for guided demonstration recording.
Ergodic control for surface finishing tasks.

A key result: the tool-based LLM architecture generalizes beyond KMP-based skills to ergodic control, enabling the same chat interface to drive voice-commanded surface finishing. Users freely switch between modalities — voice for obstacle avoidance, kinesthetic for fine corrections, graphical for verification.

Validated live on a 7-DoF torque-controlled DLR robot at the Automatica 2025 trade fair, running a local LLM backend (Qwen2.5-VL-72B-Instruct) for data privacy and low latency.

Paper on ArXiv