SmolVLA: Open-Source Robotics AI
SmolVLA: Compact 450M open-source robotics AI model. Runs on MacBook, trained on community datasets. 30% faster inference, outperforms larger VLA models.
"RoboPub" Publication: 20% Discount Offer Link.
A compact open-source model running on a MacBook. Community-driven datasets are driving robotics in the real world. A new era of accessible, intelligent machines has arrived.
Today, we introduce SmolVLA, a compact (450 million parameters), open-source vision-language-action model for robotics that can run on consumer-grade hardware.
Pre-trained exclusively on open-source community-shared datasets with compatible licenses, tagged with LERobot.
SmolVLA-450M outperforms many larger VLA models and strong baseline models like ACT in simulated environments (LIBERO, Meta-World) and real-world tasks (SO100, SO101).
Supports asynchronous inference, enabling 30% faster response and twice the task throughput.
Useful links: Hardware used for training and evaluating SO-100/101: https://github.com/TheRobotStudio/SO-ARM100
Base model: https://huggingface.co/lerobot/smolvla_base
Paper: https://huggingface.co/papers/2506.01844
In recent years, Transformers have driven significant progress in the AI field, from language models capable of human-like reasoning to multimodal systems that understand images and text.
However, progress in real-world robotics has been much slower. Robots still struggle to generalize across diverse objects, environments, and tasks. This limited progress stems from the lack of high-quality, diverse data and models that cannot reason and act in the physical world like humans.
To address these challenges, the field has recently turned to Vision-Language-Action (VLA) models, which aim to unify perception, language understanding, and action prediction within a single architecture.
VLAs typically take raw visual observations and natural language instructions as input and output corresponding robot actions. Despite their promise, many recent advances in the VLA field remain locked behind proprietary models trained on large-scale private datasets, often requiring expensive hardware setups and substantial engineering resources.