Summary of the ActionSense Dataset Contents

A dataset informational card, based on the framework proposed by Gebru et al., "Datasheets for datasets," CoRR, 2018. See also this Latex template.

The ActionSense dataset contains high-resolution synchronized data streams from a suite of wearable sensors, environment-mounted sensors, and ground truth labels.

The current dataset contains 10* subjects, and is actively growing with a target of containing approximately 25 subjects by the Fall of 2022.

It currently spans approximately 778.0 minutes of recorded data, averaging 77.8±16.4 minutes per subject. Approximately 543.5 minutes of that time is occupied by performing kitchen activities (55.6±13.7 minutes per subject), while the remainder is occupied by calibration routines.

The dataset provides synchronized labels as ground truth data, spanning 20 unique activities. Of the time spent performing activities, 64.9% of the data has ground-truth labels entered in real time during the experiment. This leaves approximately 19.5 minutes of unlabeled data per subject, or 0.98 minutes per activity per subject; this generally represents the time spent providing instructions between activities.

*In the current dataset, data for one of the subjects only contains camera data due to an anomalous technical issue with the wearable streaming; this will be remedied shortly and updated on the repository, although it can still be useful for computer vision in the meantime. The remainder of the dataset statistics focus on the 9 subjects with complete data.

*Due to an issue with the Manus gloves, finger-tracking data is not reliable for the initial subjects.  This will be remedied shortly. 

Sensor Streams

Wearable Sensors

Body and Finger Tracking

The Xsens MTw Awinda system estimates body pose and position. It uses 17 wireless IMU modules worn on the body with elastic straps, a tight-fitting jacket, and a headband. Manus Prime II Xsens gloves augment the skeleton with hand poses by using embedded bend sensors.

The Xsens and Manus applications process sensor data in real time to estimate positions and orientations of each body and finger segment. The continuous ActionSense recordings include the third-party calibration periods comprising known poses and motions, as well as custom calibration periods featuring known global positions. These are performed at both the start and end of each experiment to help estimate drift.

During experiments, ActionSense records data that is streamed from the Xsens software via a network socket. In addition, the Xsens software can record all data in a proprietary format and then reprocess it after the experiment; ActionSense includes scripts to import these improved estimates and synchronize them with the rest of the sensor suite. The original data is also provided separately.

Limitations: These sensors provide wearable body and finger tracking, facilitating freedom of motion and larger workspaces. However, accuracy may be reduced when compared to external motion-tracking infrastructure. For example, the global positions inferred from the IMUs are observed to drift over time. Finally, the Manus gloves include IMUs that can be fused with the bend sensors, but current experiments disable them due to observed accuracy issues; this reduces the measured degrees of freedom to only capturing finger curling and not lateral finger spreading.

Eye Tracking

The Pupil Core headset by Pupil Labs features a wide-angle RGB camera to view the world, and an infrared camera aimed at the pupil. The Pupil Capture software detects the pupil orientation in real time and maps the estimated gaze into world-view image coordinates to estimate where the person is looking. The system is calibrated by requesting the user to gaze at a displayed target while moving their head to vary their eye orientation. ActionSense records data from all sensors during this process, to enable re-calibration or accuracy evaluation in post-processing.

The Pupil Capture software streams data to the ActionSense framework via network sockets during experiments, and can also save all data to dedicated files. ActionSense includes scripts to merge its recorded data into the main dataset after an experiment, to provide the most consistent sampling rates possible. The original streamed data and the raw recorded data are also available.

Limitations: The small wearable sensor provides valuable attention estimates, but it requires a USB cable; ActionSense uses a stretchable coiled overhead cable to reduce motion hindrance, and a wearable computer could eliminate the tether. In addition, current experiments use a single eye camera; leveraging binocular information may improve gaze accuracy. Finally, the wide-angle camera is adjusted to view as much of the task as possible, but does not fully span the subject's field of view.

Muscle Activity

A Myo Gesture Control Armband from Thalmic Labs is worn on each forearm. It contains 8 differential pairs of dry EMG electrodes to detect muscle activity, an accelerometer, a gyroscope, and a magnetometer. It also fuses the IMU data to estimate forearm orientation, and classifies a set of five built-in gestures. ActionSense wirelessly streams all data by leveraging a Python API. The device normalizes and detrends muscle activity without dedicated calibration, but forearm orientation estimates are relative to an arbitrary starting pose. To facilitate transforming these into global or task reference frames, the calibration poses include known arm orientations.

Limitations: The current sensor suite only includes muscle activity from the forearms, which are highly useful for the chosen manipulation tasks but which may not capture all relevant forces and stiffnesses. In addition, the estimated orientations may drift over time.

Tactile Sensors

Custom sensors on each glove provide tactile information. Conductive threads are taped to a pressure-sensitive material that decreases its electrical resistance when force is applied. Threads are oriented perpendicularly to each other on opposite sides of the material, such that each intersection point acts as a pressure sensor. Measuring resistance between each pair of threads yields a matrix of tactile readings. This technique has been explored previously for applications such as smart carpets.

The current implementation features 32 threads in each direction that are routed to form a 23x19 grid on the palm, a 5x9 grid on the thumb, an 11x4 grid on the little finger, and a 13x4 grid on each other finger. The readings are performed by a microcontroller and a custom PCB for processing and multiplexing. These are worn on the hand or arm, and data is streamed either via USB or wirelessly.

ActionSense provides two types of calibration to facilitate force or pose estimation pipelines. First, the subject presses on a Dymo M25 Postal Scale using a flat hand or individual fingers while weight readings are recorded from the scale via USB. The person then holds 5 unique objects. All sensors are recorded during these activities, including the depth camera and finger joint data.

Limitations:These flexible sensors provide high-resolution tactile information, but their sampling rate and material response time may preclude highly dynamic tasks. Calibration periods are designed to help interpret the readings, but accuracy when converting to physical units may vary over time. Finally, the tactile sensors do not reach the fingertips since the underlying gloves feature open fingers.

Ground Truth and Environment-Mounted Sensors

Interactive Activity Labeling

The ActionSense software includes a GUI that enables real-time labeling while accommodating unexpected issues or events. It allows researchers to indicate when calibrations or activities start and stop, and to mark each one as good or bad with explanatory notes. Entries are timestamped and saved using the common data format, so pipelines can extract labeled segments from any subset of sensors.


RGB Cameras: Five FLIR GS2-GE-20S4C-C cameras provide ground-truth vision data of human activities. Four cameras surround the kitchen perimeter, while one camera is mounted above the main table. Images are captured at 22 Hz with 1600x1200 resolution. A custom ROS framework communicates with all cameras. Each camera is calibrated using a checkerboard. The calibration result, raw-format frame images with timestamps, and generated videos are all included in the dataset.

Depth Camera: Since most of the chosen activities interact with objects on the table, detailed 3D information of this area can be helpful. We mount an Intel RealSense D415 Depth Camera in front of the kitchen island. We stream this depth camera at 15 Hz with 640x480 resolution. An Intel RealSense driver is used to communicate with the camera. Raw images and the depth pointcloud are recorded, and the extrinsic calibration information is also provided.

Limitations: These cameras provide useful ground truth data about the activities in the kitchen environment, but there may be some areas that cannot be viewed while a subject is doing certain tasks (for example, inside the dishwasher). In such cases, the first-person camera may still provide vision data. In the future, we plan to do multi-camera calibration for 3D reconstruction of the environment.


Two microphones provide audio information that can be used by learning, segmentation, or auto-labeling pipelines. One has an omnidirectional pickup pattern and is secured above the main table, while the other has a cardioid pickup pattern and is placed on the counter behind the sink. Since the environment is within a common lab space, background noise and speech may be included in the data; this could be a limitation or a benefit depending on the goals of future analysis pipelines.

Activity Stats

The below table summarizes the current counts and durations of labeled activities across 5 subjects. Note that some labels can be further segmented into sub-tasks in the future via manual post-processing or auto-labeling pipelines based on subsets of the sensors; for example, the dataset currently labels slicing a cucumber as a single activity, but individual slices could also be labeled either manually or via sensors such as audio.

File Structures

Wearable Data

Data from all wearable sensors is organized hierarchically into a single HDF5 file. This is a cross-platform file format that can be used with multiple programming languages including Python, Matlab, Java, and C++. Data is organized according to devices and then streams. Each stream contains at least 3 entries: the data, timestamps for each row of the data as seconds since epoch, and timestamps for each row of data as a date-time string. Streams may also choose to include extra entries, such as timestamps generated internally to the sensor that can be used during post-processing.

The below table summarizes the data streams and organization for the wearable sensors and labels. It includes the typical sampling rates achieved, the data size where N is the number of timesteps, and the metadata that is embedded within the HDF5 file. Note that this file format is also easily modifiable, making it amenable to such tasks as fixing or refining labels and adding new sensors.

Video and Audio Data

The data from every camera contains raw sequential images, a text file with timestamps, and a generated video. The five FLIR GS2-GE-20S4C-C cameras are named as C1 through C5. Each image is named as frameXXXXXX in which XXXXXX is the image index. The data from the Intel RealSense D415 depth camera contains raw sequential depth data and raw sequential images. The depth data is following the PointCloud message type defined in ROS sensor_msgs package, and the data for each frame is stored in a text file named as frameXXXXXX in which XXXXXX is the index.

The overhead microphone M1 is omnidirectional, while the sink-level microphone M2 has a cardioid pickup pattern. Audio data is streamed at 48 kHz as chunks of 2048 samples, with 2 bytes per sample. The dataset includes audio data from each microphone as an uncompressed WAV file. Timestamps for each chunk of audio data is included in an HDF5 file, which also contains metadata about the data format and interpretation.