Foundation models have unlocked powerful zero-shot capabilities in vision and language, yet their reliance on internet-sourced pretraining data leaves them brittle in unstructured, real-world environments. The messy, real-world data encountered during deployment – such as low resolution images, occluded signs, or multilingual text – remains massively underrepresented in existing corpora. Robots, as embodied agents, are uniquely positioned to close this gap: they can act in physical environments to collect large-scale, real-world data that enriches foundation model training with precisely the examples current models lack. We introduce the Robot Powered Data Flywheel, a framework that transforms robots from consumers of foundation models into data generators. By deploying robots equipped with foundation models in the wild, we enable a virtuous cycle: robots perform useful tasks while simultaneously collecting domain-representative data that improves both domain-specific adaptation and domain-adjacent generalization. We instantiate this framework with Scanford, a mobile manipulator robot deployed in the East Asia Library for two weeks. Scanford autonomously scans shelves, identifies books using a vision-language model (VLM), and leverages the library catalog to automatically label images without human annotation. This deployment both aids librarians and produces a curated dataset to finetune the underlying VLM, improving performance on the domain-specific in-the-wild library setting and on domain-adjacent multilingual OCR benchmarks. Using data collected from 2103 shelves, Scanford improves VLM performance on multilingual book identification from 32.0% to 71.8% and boosts domain-adjacent multilingual OCR from 24.8% to 46.6% (English) and 30.8% to 38.0% (Chinese), while saving an estimated 18.7 hours of human labor. These results highlight how robot-powered data flywheels can both reduce human effort in real deployments and unlock new pathways for continually adapting foundation models to the messy realities of the world.
The East Asia Library is a large library with varied shelf heights, lengths, backgrounds, and lighting conditions. We consider the task of scanning bookshelves for inventory management, a time-consuming task for librarians.
We visualize a representative shelf from each deployment session to highlight the diversity of the library.
Reading and identifying books in the wild is challenging due to damaged book labels, occlusions, and fading and aging on books -- these cases are often underrepresented in the curated, internet data used to pre-train foundation models. The East Asia Lbrary holds another challenge: all its books are in Chinese, Japanese, or Korean. Existing foundation models struggle with these languages due to the heavy bias towards English in internet data. Scanford addresses these challenges by fine-tuning the VLM on real-world data collected from the East Asia Library using a mobile manipulator.
We evaluate Scanford, a Robot-Powered Data Flywheel system, for foundation model adaptation and generalization improvement through in-the-wild deployment, as well as its deployment efficacy.
A key insight of our framework is that by deploying robots in the wild, we can collect domain-representative data that can fill the gaps missing from the curated, internet data used to pre-train foundation models. This data, in turn, can be used to improve the performance of foundation models on both domain-specific and domain-adjacent tasks.
Our experiments demonstrate significant improvements (+39.0%) in the domain-specific task of book identification through continuous learning and adaptation. The data flywheel approach enables the system to autonomously collect domain-representative data, leading to Scanford outperforming both pre-trained VLM baselines.
We evaluate the performance of the VLM on two domain-adjacent tasks: English and Chinese "difficult" OCR as classified by prior work (see examples below). These cases can contain severe occlusion, distortions, calligraphic fonts, and blur and thus are often cleaned out of internet data. However, these challenges cannot be ignored as they are prevalent in in the wild settings -- the RPDF framework proposes using robot deployments to address these final-mile perception challenges.


































We find that fine-tuning the VLM on real-world data collected by Scanford leads to significant improvements for both domain-adjacent tasks (English OCR: +21.8%, Chinese OCR: +7.2%).
This highlights that the data collected by Scanford not only improves domain-specific performance, but can also improve the foundation model's generalization capabilities on domain-adjacent tasks. More broadly, this supports the insight that real-world data collected by our framework is capable of filling the domain gaps of the curated, internet data used to pre-train foundation models.
Over two weeks, Scanford scanned a total of 2103 bookshelves, saving a librarian-estimated 18.7 hours of manual scanning. Scanford also averaged 2.6 human interventions per day over the course of its full deployment.
We use the following prompts to evaluate the VLM's performance on the domain-specific and domain-adjacent tasks.
# Book Shelf Image Analysis Task
## Current Image
Image: [image]
## Task
Please analyze the image and provide a label describing the books visible in this image.
The label should include:
- The title of the book visible on the spine
- The call number of the book visible on the spine (usually at the bottom)
Each label should include all the books facing the camera and each book should be separated by a semicolon.
To be clear, the formatting should be: title | call number; title | call number; title | call number; ...
Format your response as a clear, concise label that could be used for training machine learning models. Make sure you only output the books that are visible in the image.
## Label:
## Task
Please analyze the image and provide a label describing the books visible in this subdivided image.
The label should include:
- The title of the book visible on the spine
- The call number of the book visible on the spine (usually at the bottom)
Each label should include all the books facing the camera and each book should be separated by a semicolon.
To be clear, the formatting should be: title | call number; title | call number; title | call number; ...
Format your response as a clear, concise label that could be used for training machine learning models. Make sure you only output the books that are visible in the image.
[image]
## Label:
## Task
Please analyze the image and provide a label with all the text visible exactly as it appears in the image. Only return plain text characters (e.g. no @ symbols).
## Label:
## Task
Please analyze the image and provide a label with all the text visible exactly as it appears in the image.
## Label:
@article{grannen2025scanford,
title = {Robot-Powered Data Flywheels: Deploying Robots in the Wild for Continual Data Collection and Foundation Model Adaptation},
author = {Jennifer Grannen and Michelle Pan and Kenneth Llontop and Cherie Ho and Mark Zolotas and Jeannette Bohg and Dorsa Sadigh},
year = 2025,
journal = {arXiv preprint arXiv:2511.19647}
}