MolmoMotion: Language-guided 3D motion forecasting
co/collections/allenai/molmomotion | 📄 Tech Report: https://allenai. org/papers/molmomotion | 📊 Data: https://huggingface.
Key Takeaways
- A Blog post by Ai2 on Hugging Face Back to Articles a]:hidden"> MolmoMotion: Language-guided 3D motion forecasting Enterprise Article Published June 17, 2026 Upvote 1 Kyle Wiggers Ai2Comms Follow allenai 🧠 Models: https://huggingface.
co/datasets/allenai/molmo-motion-1m | 💻 Code: https://github.
- Many of the systems and applications we want to build need to look forward instead.
A robot reaching for a cup has to anticipate how the cup will move before it touches it.
- View video Given an RGB observation, a set of query points on an object, and an action description, MolmoMotion predicts the object's future 3D point trajectory.
These predicted trajectories can then drive downstream applications such as robotics planning and trajectory-conditioned video generation.
- We chose it because we needed a general motion representation with three properties: Class-agnostic: not tied to templates for human bodies, hands, rigid objects, or any other fixed category.
View-stable: the same physical motion should be represented consistently across cameras and viewpoints.
- To forecast those trajectories, MolmoMotion uses Molmo 2 as its backbone, allowing it to connect language instructions to objects and points in an image.
Stats & Key Facts
- #A Blog post by Ai2 on Hugging Face Back to Articles a]:hidden"> MolmoMotion: Language-guided 3D motion forecasting Enterprise Article Published June 17, 2026 Upvote 1 Kyle Wiggers Ai2Comms Follow allenai 🧠 Models: https://huggingface.
co/datasets/allenai/molmo-motion-1m | 💻 Code: https://github. git | 🌐 Project Page: https://molmomotion. io/ Machines have become remarkably good at perceiving motion.
Given a video, modern models can track how objects and points move through a scene with exceptionally high confidence. But perception is inherently retrospective: it explains motion that has already happened. Many of the systems and applications we want to build need to look forward instead.
A robot reaching for a cup has to anticipate how the cup will move before it touches it. A video generator has to know what realistic motion comes next if it's going to produce physically plausible frames. Predicting motion is harder than observing it, but it's also far more useful in many scenarios.
For more details please read the original article at Hugging Face.
Continue Learning
Comments
Sign in to join the conversation