Robot-Powered Data Flywheels: Deploying Robots in the Wild for Continual Data Collection and Foundation Model Adaptation

1Stanford University, 2Toyota Research Institute

TLDR: During in the wild deployments, Robot-Powered Data Flywheel robots perform useful tasks while simultaneously collecting domain-representative data that improves both domain-specific adaptation and domain-adjacent generalization.

Abstract

Foundation models have unlocked powerful zero-shot capabilities in vision and language, yet their reliance on internet-sourced pretraining data leaves them brittle in unstructured, real-world environments. The messy, real-world data encountered during deployment – such as low resolution images, occluded signs, or multilingual text – remains massively underrepresented in existing corpora. Robots, as embodied agents, are uniquely positioned to close this gap: they can act in physical environments to collect large-scale, real-world data that enriches foundation model training with precisely the examples current models lack. We introduce the Robot Powered Data Flywheel, a framework that transforms robots from consumers of foundation models into data generators. By deploying robots equipped with foundation models in the wild, we enable a virtuous cycle: robots perform useful tasks while simultaneously collecting domain-representative data that improves both domain-specific adaptation and domain-adjacent generalization. We instantiate this framework with Scanford, a mobile manipulator robot deployed in the East Asia Library for two weeks. Scanford autonomously scans shelves, identifies books using a vision-language model (VLM), and leverages the library catalog to automatically label images without human annotation. This deployment both aids librarians and produces a curated dataset to finetune the underlying VLM, improving performance on the domain-specific in-the-wild library setting and on domain-adjacent multilingual OCR benchmarks. Using data collected from 2103 shelves, Scanford improves VLM performance on multilingual book identification from 32.0% to 71.8% and boosts domain-adjacent multilingual OCR from 24.8% to 46.6% (English) and 30.8% to 38.0% (Chinese), while saving an estimated 18.7 hours of human labor. These results highlight how robot-powered data flywheels can both reduce human effort in real deployments and unlock new pathways for continually adapting foundation models to the messy realities of the world.


Robot-Powered Data Flywheel Framework

Our Robot-Powered Data Flywheel framework (RPDF) bridges the gap between the curated internet data used to pre-train foundation models and real-world, messy deployment settings by deploying robots in the wild to collect domain-representative data.

Scanford Instantiation

We deploy Scanford in the East Asia Library for two weeks to scan books and assist inventory management [Left]. Scanford uses a mobile manipulator to collect pictures of bookshelves and leverages a VLM to identify the books in each image by title and call number. These labels are then compared with a library catalog database to curate a clean, accurate dataset for VLM fine-tuning [Right, Top]. Crucially, the autonomously gathered data improves not only the domain-specific performance on book identification, but also domain-adjacent generalizability of foundation models (multilingual OCR) [Right, Bottom]. Scanford simultaneously (1) saves 18.7 hours of manual scanning, (2) collects real-world book data, and (3) improves the very foundation model it relies on – enhancing its own performance on the library task while also strengthening the model’s broader multilingual OCR capabilities.


In the Wild Deployment

East Asia Library Setting

The East Asia Library is a large library with varied shelf heights, lengths, backgrounds, and lighting conditions. We consider the task of scanning bookshelves for inventory management, a time-consuming task for librarians.

Navigation challenges visualization

We visualize a representative shelf from each deployment session to highlight the diversity of the library.

Challenges with Reading Books

Library image 1

Reading and identifying books in the wild is challenging due to damaged book labels, occlusions, and fading and aging on books -- these cases are often underrepresented in the curated, internet data used to pre-train foundation models. The East Asia Lbrary holds another challenge: all its books are in Chinese, Japanese, or Korean. Existing foundation models struggle with these languages due to the heavy bias towards English in internet data. Scanford addresses these challenges by fine-tuning the VLM on real-world data collected from the East Asia Library using a mobile manipulator.


Experiments

We evaluate Scanford, a Robot-Powered Data Flywheel system, for foundation model adaptation and generalization improvement through in-the-wild deployment, as well as its deployment efficacy.

Adaptation Experiments

A key insight of our framework is that by deploying robots in the wild, we can collect domain-representative data that can fill the gaps missing from the curated, internet data used to pre-train foundation models. This data, in turn, can be used to improve the performance of foundation models on both domain-specific and domain-adjacent tasks.

Domain-Specific Improvement

Library image 1

Our experiments demonstrate significant improvements (+39.0%) in the domain-specific task of book identification through continuous learning and adaptation. The data flywheel approach enables the system to autonomously collect domain-representative data, leading to Scanford outperforming both pre-trained VLM baselines.

Domain-Adjacent Adaptation

We evaluate the performance of the VLM on two domain-adjacent tasks: English and Chinese "difficult" OCR as classified by prior work (see examples below). These cases can contain severe occlusion, distortions, calligraphic fonts, and blur and thus are often cleaned out of internet data. However, these challenges cannot be ignored as they are prevalent in in the wild settings -- the RPDF framework proposes using robot deployments to address these final-mile perception challenges.

Difficult OCR Examples (English)

Difficult OCR Examples (Chinese)

We find that fine-tuning the VLM on real-world data collected by Scanford leads to significant improvements for both domain-adjacent tasks (English OCR: +21.8%, Chinese OCR: +7.2%).

Domain-adjacent results

This highlights that the data collected by Scanford not only improves domain-specific performance, but can also improve the foundation model's generalization capabilities on domain-adjacent tasks. More broadly, this supports the insight that real-world data collected by our framework is capable of filling the domain gaps of the curated, internet data used to pre-train foundation models.

Deployment Efficacy Results

Over two weeks, Scanford scanned a total of 2103 bookshelves, saving a librarian-estimated 18.7 hours of manual scanning. Scanford also averaged 2.6 human interventions per day over the course of its full deployment.


VLM Prompts

We use the following prompts to evaluate the VLM's performance on the domain-specific and domain-adjacent tasks.

Domain-Specific Evaluation Prompts

Gemini Domain-Specific Prompt

# Book Shelf Image Analysis Task
## Current Image
Image: [image]
  
## Task
Please analyze the image and provide a label describing the books visible in this image. 
  
The label should include:
- The title of the book visible on the spine
- The call number of the book visible on the spine (usually at the bottom)
Each label should include all the books facing the camera and each book should be separated by a semicolon. 
To be clear, the formatting should be: title | call number; title | call number; title | call number; ...
  
Format your response as a clear, concise label that could be used for training machine learning models. Make sure you only output the books that are visible in the image.
  
## Label:

Qwen Domain-Specific Prompt

## Task
Please analyze the image and provide a label describing the books visible in this subdivided image. 
  
The label should include:
- The title of the book visible on the spine
- The call number of the book visible on the spine (usually at the bottom)
Each label should include all the books facing the camera and each book should be separated by a semicolon. 
To be clear, the formatting should be: title | call number; title | call number; title | call number; ...
  
Format your response as a clear, concise label that could be used for training machine learning models. Make sure you only output the books that are visible in the image.
[image]
  
## Label:

Domain-Adjacent Evaluation Prompts

Gemini Domain-Adjacent Prompt

## Task
Please analyze the image and provide a label with all the text visible exactly as it appears in the image. Only return plain text characters (e.g. no @ symbols).
  
## Label:

Qwen Domain-Adjacent Prompt

## Task 
Please analyze the image and provide a label with all the text visible exactly as it appears in the image. 

## Label:

Citation

@article{grannen2025scanford,
  title   = {Robot-Powered Data Flywheels: Deploying Robots in the Wild for Continual Data Collection and Foundation Model Adaptation},
  author  = {Jennifer Grannen and Michelle Pan and Kenneth Llontop and Cherie Ho and Mark Zolotas and Jeannette Bohg and Dorsa Sadigh},
  year    = 2025,
  journal = {arXiv preprint arXiv:2511.19647}
}