Published April 25, 2025 | Version 1.0.0
Dataset Open

REASSEMBLE: A Multimodal Dataset for Contact-rich Robotic Assembly and Disassembly

Description

REASSEMBLE: A Multimodal Dataset for Contact-rich Robotic Assembly and Disassembly

๐Ÿ“‹ Introduction

Robotic manipulation remains a core challenge in robotics, particularly for contact-rich tasks such as industrial assembly and disassembly. Existing datasets have significantly advanced learning in manipulation but are primarily focused on simpler tasks like object rearrangement, falling short of capturing the complexity and physical dynamics involved in assembly and disassembly. To bridge this gap, we present REASSEMBLE (Robotic assEmbly disASSEMBLy datasEt), a new dataset designed specifically for contact-rich manipulation tasks. Built around the NIST Assembly Task Board 1 benchmark, REASSEMBLE includes four actions (pick, insert, remove, and place) involving 17 objects. The dataset contains 4,551 demonstrations, of which 4,035 were successful, spanning a total of 781 minutes. Our dataset features multi-modal sensor data including event cameras, force-torque sensors, microphones, and multi-view RGB cameras. This diverse dataset supports research in areas such as learning contact-rich manipulation, task condition identification, action segmentation, and more. We believe REASSEMBLE will be a valuable resource for advancing robotic manipulation in complex, real-world scenarios.

โœจ Key Features

  • Multimodality: REASSEMBLE contains data from robot proprioception, RGB cameras, Force&Torque sensors, microphones, and event cameras
  • Multitask labels: REASSEMBLE contains labeling which enables research in Temporal Action Segmentation, Motion Policy Learning, Anomaly detection, and Task Inversion.
  • Long horizon: Demonstrations in the REASSEMBLE dataset cover long horizon tasks and actions which usually span multiple steps.
  • Hierarchical labels: REASSEMBLE contains actions segmentation labels at two hierarchical levels.

๐Ÿ”ด Dataset Collection

Each demonstration starts by randomizing the board and object poses, after which an operator teleoperates the robot to assemble and disassemble the board while narrating their actions and marking task segment boundaries with key presses. The narrated descriptions are transcribed using Whisper [1], and the board and camera poses are measured at the beginning using a motion capture system, though continuous tracking is avoided due to interference with the event camera. Sensory data is recorded with rosbag and later post-processed into HDF5 files without downsampling or synchronization, preserving raw data and timestamps for future flexibility. To reduce memory usage, video and audio are stored as encoded MP4 and MP3 files, respectively. Transcription errors are corrected automatically or manually, and a custom visualization tool is used to validate the synchronization and correctness of all data and annotations. Missing or incorrect entries are identified and corrected, ensuring the dataset’s completeness. Low-level Skill annotations were added manually after data collection,  and all labels were carefully reviewed to ensure accuracy.

๐Ÿ“‘ Dataset Structure

The dataset consists of several HDF5 (.h5) and JSON (.json) files, organized into two directories. The poses directory contains the JSON files, which store the poses of the cameras and the board in the world coordinate frame. The data directory contains the HDF5 files, which store the sensory readings and annotations collected as part of the REASSEMBLE dataset. Each JSON file can be matched with its corresponding HDF5 file based on their filenames, which include the timestamp when the data was recorded. For example, 2025-01-09-13-59-54_poses.json corresponds to 2025-01-09-13-59-54.h5.

The structure of the JSON files is as follows:

{"Hama1": [
        [x ,y, z],
        [qx, qy, qz, qw]
 ], 
 "Hama2": [
        [x ,y, z],
        [qx, qy, qz, qw]
 ], 
 "DAVIS346": [
        [x ,y, z],
        [qx, qy, qz, qw]
 ], 
 "NIST_Board1": [
        [x ,y, z],
        [qx, qy, qz, qw]
 ]
}

[x, y, z] represent the position of the object, and [qx, qy, qz, qw] represent its orientation as a quaternion.

The HDF5 (.h5) format organizes data into two main types of structures: datasets, which hold the actual data, and groups, which act like folders that can contain datasets or other groups. In the diagram below, groups are shown as folder icons, and datasets as file icons. The main group of the file directly contains the video, audio, and event data. To save memory, video and audio are stored as encoded byte strings, while event data is stored as arrays. The robot’s proprioceptive information is kept in the robot_state group as arrays. Because different sensors record data at different rates, the arrays vary in length (signified by the N_xxx variable in the data shapes). To align the sensory data, each sensor’s timestamps are stored separately in the timestamps group. Information about action segments is stored in the segments_info group. Each segment is saved as a subgroup, named according to its order in the demonstration, and includes a start timestamp, end timestamp, a success indicator, and a natural language description of the action. Within each segment, low-level skills are organized under a low_level subgroup, following the same structure as the high-level annotations.

๐Ÿ“ <date_time>.h5
โ”œโ”€โ”€๐Ÿ“„ hama1 - mp4 encoded video
โ”œโ”€โ”€๐Ÿ“„ hama2_audio - mp3 encoded audio
โ”œโ”€โ”€๐Ÿ“„ hama2 - mp4 encoded video
โ”œโ”€โ”€๐Ÿ“„ hama2_audio - mp3 encoded audio
โ”œโ”€โ”€๐Ÿ“„ hand - mp4 encoded video
โ”œโ”€โ”€๐Ÿ“„ hand_audio - mp3 encoded audio
โ”œโ”€โ”€๐Ÿ“„ capture_node - mp4 encoded video (Event camera)
โ”œโ”€โ”€๐Ÿ“„ events - N_events x 3 (x, y, polarity)
โ”œโ”€โ”€๐Ÿ“ robot_state
โ”‚   โ”œโ”€โ”€๐Ÿ“„ compensated_base_force - N_bf x 3 (x, y, z)
โ”‚   โ”œโ”€โ”€๐Ÿ“„ compenseted_base_torque - N_bt x 3 (x, y, z)
โ”‚   โ”œโ”€โ”€๐Ÿ“„ gripper_positions - N_grip x 2 (left, right)
โ”‚   โ”œโ”€โ”€๐Ÿ“„ joint_efforts - N_je x 7 (one for each joint)
โ”‚   โ”œโ”€โ”€๐Ÿ“„ joint_positions - N_jp x 7 (one for each joint)
โ”‚   โ”œโ”€โ”€๐Ÿ“„ joint_velocities - N_jv x 7 (one for each joint)
โ”‚   โ”œโ”€โ”€๐Ÿ“„ measured_force - N_mf x 3 (x, y, z)
โ”‚   โ”œโ”€โ”€๐Ÿ“„ measured_torque - N_mt x 7 (x, y, z)
โ”‚   โ”œโ”€โ”€๐Ÿ“„ pose - N_poses x 7 (x, y, z, qw, qx, qy, qz)
โ”‚   โ””โ”€โ”€๐Ÿ“„ velocity - N_vels x 7 (x, y, z, ω, γ, θ)
โ”œโ”€โ”€๐Ÿ“ timestamps
โ”‚   โ”œโ”€โ”€๐Ÿ“„ hama1 - N_hama1 x 1
โ”‚   โ”œโ”€โ”€๐Ÿ“„ hama2 - N_hama1 x 1
โ”‚   โ”œโ”€โ”€๐Ÿ“„ hand - N_hand x 1
โ”‚   โ”œโ”€โ”€๐Ÿ“„ capture_node - N_capture x 1
โ”‚   โ”œโ”€โ”€๐Ÿ“„ events - N_events x 1
โ”‚   โ”œโ”€โ”€๐Ÿ“„ compensated_base_force - N_bf x 1
โ”‚   โ”œโ”€โ”€๐Ÿ“„ compenseted_base_torque - N_bt x 1
โ”‚   โ”œโ”€โ”€๐Ÿ“„ gripper_positions - N_grip x 1
โ”‚   โ”œโ”€โ”€๐Ÿ“„ joint_efforts - N_je x 1
โ”‚   โ”œโ”€โ”€๐Ÿ“„ joint_positions - N_jp x 1
โ”‚   โ”œโ”€โ”€๐Ÿ“„ joint_velocities - N_jv x 1
โ”‚   โ”œโ”€โ”€๐Ÿ“„ measured_force - N_mf x 1
โ”‚   โ”œโ”€โ”€๐Ÿ“„ measured_torque - N_mt x 1
โ”‚   โ”œโ”€โ”€๐Ÿ“„ pose - N_poses x 1
โ”‚   โ””โ”€โ”€๐Ÿ“„ velocity - N_vels x 1
โ””โ”€โ”€๐Ÿ“ segments_info
    โ”œโ”€โ”€๐Ÿ“ 0
    โ”‚   โ”œโ”€โ”€๐Ÿ“„ start - scalar
    โ”‚   โ”œโ”€โ”€๐Ÿ“„ end - scalar
    โ”‚   โ”œโ”€โ”€๐Ÿ“„ success - Boolean
    โ”‚   โ”œโ”€โ”€๐Ÿ“„ text - scalar
    โ”‚   โ””โ”€โ”€๐Ÿ“ Low_level
    โ”‚       โ”œโ”€โ”€๐Ÿ“ 0
    โ”‚       โ”‚   โ”œโ”€โ”€๐Ÿ“„ start - scalar
    โ”‚       โ”‚   โ”œโ”€โ”€๐Ÿ“„ end - scalar
    โ”‚       โ”‚   โ”œโ”€โ”€๐Ÿ“„ success - Boolean
    โ”‚       โ”‚   โ””โ”€โ”€๐Ÿ“„ text - scalar
    โ”‚       โ””โ”€โ”€๐Ÿ“ 1
    โ”‚           โ‹ฎ
    โ””โ”€โ”€๐Ÿ“ 1
        โ‹ฎ

The splits folder contains two text files which list the h5 files used for the traning and validation splits.

๐Ÿ“Œ Important Resources

The project website contains more details about the REASSEMBLE dataset. The Code for loading and visualizing the data is avaibile on our github repository.

๐Ÿ“„ Project website: https://tuwien-asl.github.io/REASSEMBLE_page/
๐Ÿ’ป Code: https://github.com/TUWIEN-ASL/REASSEMBLE

โš ๏ธ File comments

Below is a table which contains a list records which have any issues. Issues typically correspond to missing data from one of the sensors.
Recording Issue
2025-01-10-15-28-50.h5 hand cam missing at beginning
2025-01-10-16-17-40.h5 missing hand cam
2025-01-10-17-10-38.h5 hand cam missing at beginning
2025-01-10-17-54-09.h5 no empty action at beginning
2025-01-11-14-22-09.h5 no empty action at beginning
2025-01-11-14-45-48.h5 F/T not valid for last action
2025-01-11-15-27-19.h5 F/T not valid for last action
2025-01-11-15-35-08.h5 F/T not valid for last action
2025-01-13-11-16-17.h5 gripper broke for last action
2025-01-13-11-18-57.h5 pose not available for last action

 

Files

data.zip

Files (54.8 GiB)

Name Size
md5:812103a652ca9201e87a3bcecfee4ef3
54.8 GiB Preview Download
md5:2f3f86b65dc6312b504072a2460314c2
87.3 KiB Preview Download
md5:2e1394a2fa65e4ebb6c1fd64136cb0a0
7.9 KiB Preview Download
md5:641882928ef3a8b2c3db41ac7c60b994
1.1 KiB Preview Download

Additional details

Additional titles

Alternative title (English)
Demonstrating REASSEMBLE: A Multimodal Dataset for Contact-rich Robotic Assembly and Disassembly

Identifiers

Related works

Cites
Conference Paper: 10.15607/RSS.2024.XX.120 (DOI)
Conference Paper: 10.1109/ICRA57147.2024.10611615 (DOI)
Conference Paper: 10.1177/02783649241304789 (DOI)
Journal Article: 10.1109/LRA.2024.3520916 (DOI)
Requires
Software: https://www.h5py.org/ (Other)

Funding

INteractive robots that intuitiVely lEarn to inVErt tasks by ReaSoning about their Execution (INVERSE) 101136067
European Union
Robot Industry Core Technology Development Program 00416440
Ministry of Trade, Industry and Energy (MOTIE)

Dates

Accepted
2025-04-25
Accepted at Robotics: Science and Systems
Collected
2025-01-09/2025-01-14
Date collection period
Submitted
2025-01-31
Submitted to Robotics: Science and Systems

References

  • [1] Radford, Alec, et al. "Robust speech recognition via large-scale weak supervision." International conference on machine learning. PMLR, 2023.