Visualize Dataset (v2.0+ latest dataset format)
Explore and visualize LeRobot datasets easily
In this post, we:
Recent advances in Vision-Language-Action (VLA) models have enabled robots to perform a wide range of tasks—from simple commands like “grasp the cube” to more complex activities like folding laundry or cleaning a table. These models aim to achieve generalization: the ability to perform tasks in novel settings, with unseen objects, and in varying conditions.
“The biggest challenge in robotics isn’t dexterity, but generalization—across physical, visual, and semantic levels.”
— Physical Intelligence
A robot must "figure out how to correctly perform even a simple task in a new setting or with new objects," and this requires both robust skills and common-sense understanding of the world. Yet, progress is often limited by the availability of diverse data for such robotic systems.
“Generalization must occur at many levels. At the low level, the robot must understand how to pick up a spoon (by the handle) or plate (by the edge), even if it has not seen these specific spoons or plates before, and even if they are placed in a pile of dirty dishes. At a higher level, the robot must understand the semantics of each task—where to put clothes and shoes (ideally in the laundry hamper or closet, not on the bed), and what kind of tool is appropriate for wiping down a spill. This generalization requires both robust physical skills and a common-sense understanding of the environment, so that the robot can generalize at many levels at the same time, from physical, to visual, to semantic. This is made even harder by the limited availability of diverse data for such robotic systems.”
— Physical Intelligence
To simplify, the core of generalist policies lies in a simple idea: co-training on heterogeneous datasets. By exposing VLA models to a variety of environments, tasks, and robot embodiments, we can teach models not only how to act, but why—how to interpret a scene, understand a goal, and adapt skills across contexts.
💡 “Generalization is not just a model property—it’s a data phenomenon.”
It emerges from the diversity, quality, and abstraction level of the training data.
This brings us to a fundamental question:
Given current datasets, what is the upper limit of generalization we can expect?
Can a robot meaningfully respond to a completely novel prompt—say, *"set up a surprise birthday party"*—if it has never encountered anything remotely similar during training? Especially when most datasets are collected in academic labs, by a limited number of people, under well-controlled setups?
We frame generalization as a data-centric view: treating it as the process of abstracting broader patterns from data—essentially “zooming out” to reveal task-agnostic structures and principles. This shift in perspective emphasizes the role of dataset diversity, rather than model architecture alone, in driving generalization.
So far, the majority of robotics datasets come from structured academic environments. Even if we scale up to millions of demonstrations, one dataset will often dominate, limiting diversity. Unlike ImageNet—which aggregated internet-scale data and captured the real world more holistically—robotics lacks a comparably diverse, community-driven benchmark.
This is largely because collecting data for robotics requires physical hardware and significant effort.
That’s why, at LeRobot, we’re working to make robotics data collection more accessible—at home, at school, or anywhere. We're:
We're already seeing the results: the number of community-contributed datasets on the Hub is growing rapidly.
Growth of lerobot datasets on the Hugging Face Hub over time.
If we break down the uploaded datasets by robot type, we see that most contributions are to So100 and Koch, making robotic arms and manipulation tasks the primary focus of the current LeRobot dataset landscape. However, it’s important to remember that the potential reaches far beyond. Domains like autonomous vehicles, assistive robots, and mobile navigation stand to benefit just as much from shared data. This momentum brings us closer to a future where datasets reflect a global effort, not just the contributions of a single lab or institution.
Distribution of lerobot datasets by robot type.
Here are just a few standout community-contributed datasets that show how diverse and imaginative robotics can be:
lirislab/close_top_drawer_teabox:: precise manipulation with a household drawerChojins/chess_game_001_blue_stereo: a full chess match captured from a stereo camera setuppierfabre/chicken: yes — a robot interacting with colorful animal figures, including a chicken 🐔Explore additional creative datasets under the LeRobot tag on the Hugging Face Hub, and interactively view them in the LeRobot Dataset Visualizer.
As robotics data collection becomes more democratized, curation becomes the next challenge. While these datasets are still collected in constrained setups, they are a crucial step toward affordable, general-purpose robotic policies. Not everyone has access to expensive hardware—but with shared infrastructure and open collaboration, we can build something far greater.
🧠 “Generalization isn’t solved in a lab—it’s taught by the world.”
The more diverse our data, the more capable our models will be.
Why does data quality matter? Poor-quality data results in poor downstream performance, biased outputs, and models that fail to generalize. Hence, efficient and high-quality data collection plays a critical role in advancing generalist robotic policies.
While foundation models in vision and language have thrived on massive, web-scale datasets, robotics lacks an “Internet of robots”—a vast, diverse corpus of real-world interactions. Instead, robotic data is fragmented across different embodiments, sensor setups, and control modes, forming isolated data islands.
To overcome this, recent approaches like Gr00t organize training data as a pyramid, where:
Within this framework, efficient real-world data collection is indispensable—it anchors learned behaviors in actual robotic hardware and closes the sim-to-real gap, ultimately improving the generalization, adaptability, and performance of robotics foundation models.
By expanding the volume and diversity of real-world datasets, we reduce fragmentation between heterogeneous data sources. When datasets are disjoint in terms of environment, embodiment, or task distribution, models struggle to transfer knowledge across domains.
🔗 Real-world data acts as connective tissue—it aligns abstract priors with grounded action and enables the model to build more coherent and transferable representations.
As a result, increasing the proportion of real robot interactions does not merely enhance realism—it structurally reinforces the links between all layers of the pyramid, leading to more robust and capable policies.
Data Pyramid for Robot Foundation Model Training. Adapted from Gr00t (Yang et al., 2025). Data quantity decreases while embodiment specificity increases from bottom to top.
At LeRobot, we’ve started developing an automatic curation pipeline to post-process community datasets. During the post-processing phase, we’ve identified several areas where improvements can further boost dataset quality and facilitate more effective curation going forward:
Many datasets lack task descriptions, lack details or are ambiguous in the task to be done. Semantics is currently at the core of cognition, meaning that understanding the context and specifics of a task is crucial for robotic performance. Detailed expressions ensure that robots understand exactly what is expected, but also provide a broader knowledge and vocabulary to the cognition system. Ambiguity can lead to incorrect interpretation and, consequently, incorrect actions.
Task instructions can be:
Subtask-level annotations are often missing, making it difficult to model complex task hierarchies.
While this can be handled with VLM, it is still better to have a task annotation provided by the author of the dataset at hand.
Features like images.laptop are ambiguously labeled:
Manual mapping of dataset features to standardized names is time-consuming and error-prone.
We can possibly automate feature type inference using VLMs or computer vision models to classify camera perspectives. However, keeping this in mind helps to have a cleaner dataset.
Some datasets contain:
.parquet files without reindexing), breaking the sequential consistency.Different datasets use different action or state dimensions, even for the same robot (e.g., so100).
Some datasets show inconsistencies in action/state format.
Now that we know that creating a high-quality dataset is essential for training reliable and generalizable robot policies, we have outlined a checklist of best practices to assist you in collecting effective data.
Use a consistent and interpretable naming scheme for all camera views and observations:
Format:
<modality>.<location>
Examples:
images.topimages.frontimages.leftimages.rightAvoid device-specific names:
images.laptopimages.phoneFor wrist-mounted cameras, specify orientation:
images.wrist.leftimages.wrist.rightimages.wrist.topimages.wrist.bottomConsistent naming improves clarity and helps downstream models better interpret spatial configurations and multi-view inputs.
task field to clearly describe the robot’s objectivePick the yellow lego block and put it in the boxtask1, demo2, etc.Below, we provide a checklist that serves as a guideline for recording datasets, outlining key points to keep in mind during the data collection process.
Figure 4: Dataset Recording Checklist – a step-by-step guide to ensure consistent and high-quality real-world data collection.
The next generation of generalist robots won't be built by a single person or lab — they'll be built by all of us. Whether you're a student, a researcher, or just robot-curious, here’s how you can jump in:
Start recording, start contributing—because the future of generalist robots depends on the data we build today.
Explore and visualize LeRobot datasets easily
Very interesting article. Advancing data for Physical AI will require a joint effort between academia, robotics enthusiasts, and the open-source community, alongside contributions from data collection companies, much like we saw with data labeling in AI (e.g., Scale AI, Labelbox). That’s exactly what we’re building at Deplace AI, data on-demand for Physical AI, and we’re also keen to contribute to the open-source ecosystem.
Lots of insights and wisdom. Very nice. Articulating these ideas and notions has much value and will pay off. Thank you.