Research Intelligence

Live arXiv AI Data Stream

Synchronization Active

cs.ROcs.AI

Jun 24, 2026

Learning Action Priors for Cross-embodiment Robot Manipulation

Most Vision-Language-Action (VLA) models build on a Vision-Language Model (VLM) backbone by attaching an action module and optimizing the full policy jointly. This design inherits strong visual and linguistic priors from the VLM, but leaves the action module to learn physical motion almost from scratch. As a result, the policy lacks an explicit motion prior, forcing early optimization to simultaneously discover temporal action dynamics and cross-modal alignment, a challenge further amplified in cross-embodiment settings. In this work, we propose to pretrain the action module with motion priors before cross-modal VLA alignment. Specifically, we introduce a two-stage training framework that equips the action module with cross-embodiment temporal motion structure before VLA training begins. In Stage~1, a lightweight flow-matching-based encoder-decoder action module efficiently learns temporal motion structure solely from unconditioned action trajectories, without processing visual or language tokens. In Stage~2, this learned prior is transferred to VLA training through decoder reuse and early-stage latent distillation, aligning visual-language features with the action embedding space while still allowing end-to-end policy refinement. In addition, the trained encoder serves as a compact history compressor, summarizing state-action histories into a single temporal context token for history-aware modeling at negligible cost. Extensive experiments across 13 diverse cross-embodiment tasks on both simulated and real-world platforms validate the effectiveness of our approach. Compared with VLA training without action priors, our model achieves faster convergence, higher success rates, and substantially stronger performance on data-scarce real-world tasks. Moreover, scaling up the action data in Stage~1 yields a more generalizable action prior that directly improves downstream VLA performance.

Dong JingTianqi ZhangJiaqi Liu+5 more

Learning Action Priors for Cross-embodiment Robot Manipulation

RevengeBench: Reverse Engineering Code-Space Policies from Behavioral Experiments

TryOnCrafter: Unleashing Camera Trajectories for Realistic Video Virtual Try-on via a Renderable 4D Try-on Proxy

On-Policy Self-Distillation with Sampled Demonstrations Reduces Output Diversity

MVTrack4Gen: Multi-View Point Tracking as Geometric Supervision for 4D Video Generation

Real-Time Voice AI Hears but Does Not Listen

Neglected Free Lunch from Post-training: Progress Advantage for LLM Agents

Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models

A cross-process welding penetration status prediction algorithm based on unsupervised domain adaptation in laser and TIG welding

Model Forensics: Investigating Whether Concerning Behavior Reflects Misalignment

When Certainty Is an Artifact: Keyword Lexicon Blindness and the (Mis)Measurement of Rhetorical Stance

Deviance-style normalization for jointly overdispersed counts

A welding penetration prediction model for laser welding process based on self-supervised learning using physics-informed neural networks

DomainShuttle: Freeform Open Domain Subject-driven Text-to-video Generation

The Unfireable Safety Kernel: Execution-Time AI Alignment for AI Agents and Other Escapable AI Systems

When Does Synthetic Data Augmentation Improve Score-Based Imbalanced Classification?

Natural Ungrokking: Asymmetric Control of Which Rules Survive Pretraining

RoboAtlas: Contextual Active SLAM

How Robust is OCR-Reasoning? Evaluating OCR-Reasoning Robustness of Vision-Language Models under Visual Perturbations

AI translation of literary texts is "fine", but readers still prefer human translations

FedReLa: Imbalanced Federated Learning via Re-Labeling

Detect, Unlearn, Restore: Defending Text Summarization Models Against Data Poisoning

TriViewBench: Controlled Complexity Scaling for Multi-View Structural Reasoning in MLLMs

Can Trustless Agents Be Trusted? An Empirical Study of the ERC-8004 Decentralized AI Agent Ecosystem

Why Multi-Step Tool-Use Reinforcement Learning Collapses and How Supervisory Signals Fix It

In-Context World Modeling for Robotic Control

Privacy Vulnerabilities of Attention Layers in Tabular Foundation Models and Protection of High-Risk Queries

MIMFlow: Integrating Masked Image Modeling with Normalizing Flows for End-to-End Image Generation

The Tatoxa System for Text Detoxification in Low-Resource Languages: The Case of Tatar

Is Variational Monte Carlo Robust? Sharp Moment Thresholds and Heavy-tailed Stochastic Optimization

From Sparse and Imperfect 2D Anchors to Consistent 3D Gaussian Street Scenes: Support-Aware Appearance

FORCE: Efficient VLA Reinforcement Fine-Tuning via Value-Calibrated Warm-up and Self-Distillation

Dziri Voicebot: An End-to-End Low-Resource Speech-to-Speech Conversational System for Algerian Dialect

Hierarchical Reinforcement Learning for Neural Network Compression (HiReLC): Pruning and Quantization

Variable Bound Tightening for Nash Equilibrium Computation in Multiplayer Imperfect-Information Games

Autodata: An agentic data scientist to create high quality synthetic data

SpeechEQ: Benchmarking Emotional Intelligence Quotient in Socially Aware Voice Conversational Models

Taxonomy-aware deep learning for hierarchical marine species classification in underwater imagery

Weave of Formal Thought

The Inference-Compute Frontier and a Latency-Efficient Architecture for Limit Order Book Prediction

InvestPhilBench: A Multi-Layer Dynamic Benchmark for Evaluating Large Language Model Procedural Reasoning in Expert Investment Philosophy

Multi-Agent Goal Recognition with Team- and Goal-Conditioned Reinforcement Learning and Factorized Branch-and-Bound

Tensorion: A Tensor-Aware Generalization of the Muon Optimizer

Helpful or Harmful? Evaluating LLM-Assisted Vulnerability Patching via a Human Study

Improving Neural Network Training by Decoupling the Magnitude and Direction of Weight Vectors

WinDOM: Self-Family Distillation for Small-Model GUI Grounding

A Benchmark for Heterogeneous Stereo Deblurring with Physically- and Epipolar-constrained Cross Attention

Agentic System as Compressor: Quantifying System Intelligence in Bits