RoboPub

RoboPub

Share this post

RoboPub
RoboPub
SmolVLA: Open-Source Robotics AI

SmolVLA: Open-Source Robotics AI

SmolVLA: Compact 450M open-source robotics AI model. Runs on MacBook, trained on community datasets. 30% faster inference, outperforms larger VLA models.

Meng Li's avatar
Meng Li
Jul 04, 2025
∙ Paid
1

Share this post

RoboPub
RoboPub
SmolVLA: Open-Source Robotics AI
2
Share

"RoboPub" Publication: 20% Discount Offer Link.


A compact open-source model running on a MacBook. Community-driven datasets are driving robotics in the real world. A new era of accessible, intelligent machines has arrived.

SmolVLA: Efficient Vision-Language-Action Model trained on Lerobot  Community Data

Today, we introduce SmolVLA, a compact (450 million parameters), open-source vision-language-action model for robotics that can run on consumer-grade hardware.

Pre-trained exclusively on open-source community-shared datasets with compatible licenses, tagged with LERobot.

SmolVLA-450M outperforms many larger VLA models and strong baseline models like ACT in simulated environments (LIBERO, Meta-World) and real-world tasks (SO100, SO101).

Supports asynchronous inference, enabling 30% faster response and twice the task throughput.

Useful links: Hardware used for training and evaluating SO-100/101: https://github.com/TheRobotStudio/SO-ARM100

Base model: https://huggingface.co/lerobot/smolvla_base

Paper: https://huggingface.co/papers/2506.01844

In recent years, Transformers have driven significant progress in the AI field, from language models capable of human-like reasoning to multimodal systems that understand images and text.

However, progress in real-world robotics has been much slower. Robots still struggle to generalize across diverse objects, environments, and tasks. This limited progress stems from the lack of high-quality, diverse data and models that cannot reason and act in the physical world like humans.

To address these challenges, the field has recently turned to Vision-Language-Action (VLA) models, which aim to unify perception, language understanding, and action prediction within a single architecture.

VLAs typically take raw visual observations and natural language instructions as input and output corresponding robot actions. Despite their promise, many recent advances in the VLA field remain locked behind proprietary models trained on large-scale private datasets, often requiring expensive hardware setups and substantial engineering resources.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Meng Li
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share