REASSEMBLE: A Multimodal Dataset for Contact-rich Robotic Assembly and Disassembly
Creators
Description
REASSEMBLE: A Multimodal Dataset for Contact-rich Robotic Assembly and Disassembly
๐ Introduction
Robotic manipulation remains a core challenge in robotics, particularly for contact-rich tasks such as industrial assembly and disassembly. Existing datasets have significantly advanced learning in manipulation but are primarily focused on simpler tasks like object rearrangement, falling short of capturing the complexity and physical dynamics involved in assembly and disassembly. To bridge this gap, we present REASSEMBLE (Robotic assEmbly disASSEMBLy datasEt), a new dataset designed specifically for contact-rich manipulation tasks. Built around the NIST Assembly Task Board 1 benchmark, REASSEMBLE includes four actions (pick, insert, remove, and place) involving 17 objects. The dataset contains 4,551 demonstrations, of which 4,035 were successful, spanning a total of 781 minutes. Our dataset features multi-modal sensor data including event cameras, force-torque sensors, microphones, and multi-view RGB cameras. This diverse dataset supports research in areas such as learning contact-rich manipulation, task condition identification, action segmentation, and more. We believe REASSEMBLE will be a valuable resource for advancing robotic manipulation in complex, real-world scenarios.
โจ Key Features
- Multimodality: REASSEMBLE contains data from robot proprioception, RGB cameras, Force&Torque sensors, microphones, and event cameras
- Multitask labels: REASSEMBLE contains labeling which enables research in Temporal Action Segmentation, Motion Policy Learning, Anomaly detection, and Task Inversion.
- Long horizon: Demonstrations in the REASSEMBLE dataset cover long horizon tasks and actions which usually span multiple steps.
- Hierarchical labels: REASSEMBLE contains actions segmentation labels at two hierarchical levels.
๐ด Dataset Collection
Each demonstration starts by randomizing the board and object poses, after which an operator teleoperates the robot to assemble and disassemble the board while narrating their actions and marking task segment boundaries with key presses. The narrated descriptions are transcribed using Whisper [1], and the board and camera poses are measured at the beginning using a motion capture system, though continuous tracking is avoided due to interference with the event camera. Sensory data is recorded with rosbag and later post-processed into HDF5 files without downsampling or synchronization, preserving raw data and timestamps for future flexibility. To reduce memory usage, video and audio are stored as encoded MP4 and MP3 files, respectively. Transcription errors are corrected automatically or manually, and a custom visualization tool is used to validate the synchronization and correctness of all data and annotations. Missing or incorrect entries are identified and corrected, ensuring the dataset’s completeness. Low-level Skill annotations were added manually after data collection, and all labels were carefully reviewed to ensure accuracy.
๐ Dataset Structure
The dataset consists of several HDF5 (.h5) and JSON (.json) files, organized into two directories. The poses
directory contains the JSON files, which store the poses of the cameras and the board in the world coordinate frame. The data
directory contains the HDF5 files, which store the sensory readings and annotations collected as part of the REASSEMBLE dataset. Each JSON file can be matched with its corresponding HDF5 file based on their filenames, which include the timestamp when the data was recorded. For example, 2025-01-09-13-59-54_poses.json
corresponds to 2025-01-09-13-59-54.h5
.
The structure of the JSON files is as follows:
{"Hama1": [
[x ,y, z],
[qx, qy, qz, qw]
],
"Hama2": [
[x ,y, z],
[qx, qy, qz, qw]
],
"DAVIS346": [
[x ,y, z],
[qx, qy, qz, qw]
],
"NIST_Board1": [
[x ,y, z],
[qx, qy, qz, qw]
]
}
[x, y, z]
represent the position of the object, and [qx, qy, qz, qw]
represent its orientation as a quaternion.
The HDF5 (.h5) format organizes data into two main types of structures: datasets, which hold the actual data, and groups, which act like folders that can contain datasets or other groups. In the diagram below, groups are shown as folder icons, and datasets as file icons. The main group of the file directly contains the video, audio, and event data. To save memory, video and audio are stored as encoded byte strings, while event data is stored as arrays. The robot’s proprioceptive information is kept in the robot_state group as arrays. Because different sensors record data at different rates, the arrays vary in length (signified by the N_xxx variable in the data shapes). To align the sensory data, each sensor’s timestamps are stored separately in the timestamps group. Information about action segments is stored in the segments_info group. Each segment is saved as a subgroup, named according to its order in the demonstration, and includes a start timestamp, end timestamp, a success indicator, and a natural language description of the action. Within each segment, low-level skills are organized under a low_level subgroup, following the same structure as the high-level annotations.
๐ <date_time>.h5
โโโ๐ hama1 - mp4 encoded video
โโโ๐ hama2_audio - mp3 encoded audio
โโโ๐ hama2 - mp4 encoded video
โโโ๐ hama2_audio - mp3 encoded audio
โโโ๐ hand - mp4 encoded video
โโโ๐ hand_audio - mp3 encoded audio
โโโ๐ capture_node - mp4 encoded video (Event camera)
โโโ๐ events - N_events x 3 (x, y, polarity)
โโโ๐ robot_state
โ โโโ๐ compensated_base_force - N_bf x 3 (x, y, z)
โ โโโ๐ compenseted_base_torque - N_bt x 3 (x, y, z)
โ โโโ๐ gripper_positions - N_grip x 2 (left, right)
โ โโโ๐ joint_efforts - N_je x 7 (one for each joint)
โ โโโ๐ joint_positions - N_jp x 7 (one for each joint)
โ โโโ๐ joint_velocities - N_jv x 7 (one for each joint)
โ โโโ๐ measured_force - N_mf x 3 (x, y, z)
โ โโโ๐ measured_torque - N_mt x 7 (x, y, z)
โ โโโ๐ pose - N_poses x 7 (x, y, z, qw, qx, qy, qz)
โ โโโ๐ velocity - N_vels x 7 (x, y, z, ω, γ, θ)
โโโ๐ timestamps
โ โโโ๐ hama1 - N_hama1 x 1
โ โโโ๐ hama2 - N_hama1 x 1
โ โโโ๐ hand - N_hand x 1
โ โโโ๐ capture_node - N_capture x 1
โ โโโ๐ events - N_events x 1
โ โโโ๐ compensated_base_force - N_bf x 1
โ โโโ๐ compenseted_base_torque - N_bt x 1
โ โโโ๐ gripper_positions - N_grip x 1
โ โโโ๐ joint_efforts - N_je x 1
โ โโโ๐ joint_positions - N_jp x 1
โ โโโ๐ joint_velocities - N_jv x 1
โ โโโ๐ measured_force - N_mf x 1
โ โโโ๐ measured_torque - N_mt x 1
โ โโโ๐ pose - N_poses x 1
โ โโโ๐ velocity - N_vels x 1
โโโ๐ segments_info
โโโ๐ 0
โ โโโ๐ start - scalar
โ โโโ๐ end - scalar
โ โโโ๐ success - Boolean
โ โโโ๐ text - scalar
โ โโโ๐ Low_level
โ โโโ๐ 0
โ โ โโโ๐ start - scalar
โ โ โโโ๐ end - scalar
โ โ โโโ๐ success - Boolean
โ โ โโโ๐ text - scalar
โ โโโ๐ 1
โ โฎ
โโโ๐ 1
โฎ
The splits folder contains two text files which list the h5 files used for the traning and validation splits.
๐ Important Resources
The project website contains more details about the REASSEMBLE dataset. The Code for loading and visualizing the data is avaibile on our github repository.
๐ Project website: https://tuwien-asl.github.io/REASSEMBLE_page/
๐ป Code: https://github.com/TUWIEN-ASL/REASSEMBLE
โ ๏ธ File comments
Recording | Issue |
2025-01-10-15-28-50.h5 | hand cam missing at beginning |
2025-01-10-16-17-40.h5 | missing hand cam |
2025-01-10-17-10-38.h5 | hand cam missing at beginning |
2025-01-10-17-54-09.h5 | no empty action at beginning |
2025-01-11-14-22-09.h5 | no empty action at beginning |
2025-01-11-14-45-48.h5 | F/T not valid for last action |
2025-01-11-15-27-19.h5 | F/T not valid for last action |
2025-01-11-15-35-08.h5 | F/T not valid for last action |
2025-01-13-11-16-17.h5 | gripper broke for last action |
2025-01-13-11-18-57.h5 | pose not available for last action |
Files
data.zip
Additional details
Additional titles
- Alternative title (English)
- Demonstrating REASSEMBLE: A Multimodal Dataset for Contact-rich Robotic Assembly and Disassembly
Identifiers
- arXiv
- arXiv:2502.05086
Related works
- Cites
- Conference Paper: 10.15607/RSS.2024.XX.120 (DOI)
- Conference Paper: 10.1109/ICRA57147.2024.10611615 (DOI)
- Conference Paper: 10.1177/02783649241304789 (DOI)
- Journal Article: 10.1109/LRA.2024.3520916 (DOI)
- Requires
- Software: https://www.h5py.org/ (Other)
Funding
- INteractive robots that intuitiVely lEarn to inVErt tasks by ReaSoning about their Execution (INVERSE) 101136067
- European Union
- Robot Industry Core Technology Development Program 00416440
- Ministry of Trade, Industry and Energy (MOTIE)
Dates
- Accepted
-
2025-04-25Accepted at Robotics: Science and Systems
- Collected
-
2025-01-09/2025-01-14Date collection period
- Submitted
-
2025-01-31Submitted to Robotics: Science and Systems
References
- [1] Radford, Alec, et al. "Robust speech recognition via large-scale weak supervision." International conference on machine learning. PMLR, 2023.