Berkeley MHAD

Berkeley Multimodal Human Action Database (MHAD)

About the Project

The presented dataset was generated as part of the NSF funded project (#0941382), CDI-Type I: Collaborative Research: A Bio-Inspired Approach to Recognition of Human Movements and Movement Styles. This project is a collaborative effort between the Teleimmersion Lab at University of California at Berkeley and Center for Imaging Science, Johns Hopkins University. The objective of this research was to develop bio-inspired algorithms for recognizing human movements and movement styles in various human activities.


The Berkeley Multimodal Human Action Database (MHAD) contains 11 actions performed by 7 male and 5 female subjects in the range 23-30 years of age except for one elderly subject. All the subjects performed 5 repetitions of each action, yielding about 660 action sequences which correspond to about 82 minutes of total recording time. In addition, we have recorded a T-pose for each subject which can be used for the skeleton extraction; and the background data (with and without the chair used in some of the activities). Figure 1 shows the snapshots from all the actions taken by the front-facing camera and the corresponding point clouds extracted from the Kinect data. The specified set of actions comprises of the following: (1) actions with movement in both upper and lower extremities, e.g., jumping in place, jumping jacks, throwing, etc., (2) actions with high dynamics in upper extremities, e.g., waving hands, clapping hands, etc. and (3) actions with high dynamics in lower extremities, e.g., sit down, stand up. Prior to each recording, the subjects were given instructions on what action to perform; however no specific details were given on how the action should be executed (i.e., performance style or speed). The subjects have thus incorporated different styles in performing some of the actions (e.g., punching, throwing). Figure 2 shows a snapshot of the throwing action from the reference camera of each camera cluster and from the two Kinect cameras. The figure demonstrates the amount of information that can be obtained from multi-view and depth observations as compared to a single viewpoint.

Figure 1. Snapshots from all the actions available in the Berkeley MHAD are displayed together with the corresponding point clouds obtained from the Kinect depth data. Actions (from left to right): jumping, jumping jacks, bending, punching, waving two hands, waving one hand, clapping, throwing, sit down/stand up, sit down, stand up.

Figure 2. Throwing action is displayed from the reference camera in each camera cluster as well as from the two Kinect cameras.

Demo Videos Playlist


In this section we describe the components of the multimodal acquisition system used for the collection of Berkeley MHAD. Each action was simultaneously captured by five different systems: optical motion capture system, four multi-view stereo vision camera arrays, two Microsoft Kinect cameras, six wireless accelerometers and four microphones. Figure 3 shows the layout of the sensors used in our setup.

Figure 3. Diagram of the data acquisition system.

Motion Capture System

For the ground truth we used optical motion capture system Impulse (PhaseSpace Inc., San Leandro, CA) which captured 3D position of active LED markers with the frequency of 480 Hz. The system uses linear detector based cameras with resolution of 3600x3600 pixels, providing sub-millimeter accuracy and unique identity of up to 48 markers. We used 8 motion capture cameras arranged in a circular configuration about 3 m off the ground with the capture space of about 2 m x 2 m. The motion capture system was calibrated using manufacturer software. The motion capture data acquisition server also provided the time synchronization for other sensors (i.e., cameras, Kinect, accelerometers) through the NTP using the Meinberg NTP client service which offers high accuracy in the synchronization of the system time. To capture the position of different parts of the body via the mocap system, we used a custom-built tight-fitting suit with 43 LED markers. We post-processed the 3D marker trajectories to extract the skeleton of each subject using MotionBuilder software (Autodesk Inc., San Rafael, CA, USA).

Camera System

Multi-view video data was captured by 12 Dragonfly2 cameras (Point Grey Research Inc., Richmond, BC, Canada) with the image resolution of 640x480 pixels. The cameras were arranged into four clusters: two clusters for stereo and two clusters with four cameras for multi-view capture, deployed as shown in Figure 3. We have used varifocal lenses with the focal range of 3.5 mm to 6 mm. The focal length was set at approximately 4 mm to provide full view of the subject while performing various actions. The images were capture in raw data format (i.e., "GRGB" Bayer image format) with a frame rate of about 22 Hz using hardware triggering for temporal synchronization across all views. Prior to data collection, the cameras were also geometrically calibrated and aligned with the mocap system.

Kinect System

For the depth data acquisition we have positioned two Microsoft Kinect cameras approximately in opposite directions to prevent interference between the two active pattern projections. Each Kinect camera captured a color image with a resolution of 640x480 pixels and a 16-bit depth image, both with an acquisition rate of 30 Hz. Although the color and depth acquisition inside the Kinect are not precisely synchronized, the temporal difference is not noticeable in the output due to the relatively high frame rate. The Kinect cameras were temporally synchronized with the NTP server and geometrically calibrated with the mocap system. We used the OpenNI driver for the acquisition of the images since the official Microsoft Kinect driver was not yet available at the time of the data collection.


To capture dynamics of the movement, we have applied six three-axis wireless accelerometers (Shimmer Research, Dublin, Ireland). The accelerometers were strapped or inserted into the mocap suit to measure movement at the wrists, ankles and hips. The accelerometers captured the motion data with the frequency of about 30 Hz and delivered data via the Bluetooth protocol to the acquisition computer where time stamps were applied to each collected frame. Due to the delays in the wireless communication, there was a lag of about 100 ms between the acquisition and recording times, which was compensated for by the time-stamp adjustments.

Audio System

The importance of audio in video analysis has been shown in past research where human action detection performance has been improved by integrating audio with visual information. Therefore, we decided to also record audio during performance of each action using four microphones arranged around the capture space as shown in Figure 3. Three of the microphones were placed on tripods about 80 cm off the ground while the fourth microphone was taped to the floor to capture the sounds generated at the ground surface. The audio recording was performed with the frequency of 48 kHz through analog/digital converter/amplifier which was connected via an optical link to a digital recorder.


System Number Frequency Resolution Time Synchronization
PhaseSpace Motion Capture 8 480 Hz <1mm Native
Stereo Cameras (2 Dragonfly2 cameras) 2 22 Hz RAW 640x480 Hard trigger b/w cameras
Time stamps with mocap
Quad Cameras (4 Dragonfly2 cameras) 2 22 Hz RAW 640x480 Hard trigger b/w cameras
Time stamps with mocap
Microsoft Kinect 2 30 Hz RGB 640x480
Depth 0.03 - 8.4 cm
Time stamps with mocap
Accelerometers 6 ~30 Hz 1/216 Time stamps with mocap
Microphones 4 48 kHz N/A Event synchronization with clapping

Geometrical Calibration

The purpose of the geometrical calibration was to internally calibrate the vision-based acquisition systems and to subsequently align the output data in a common coordinate system. We first performed the geometrical calibration of the motion capture system using the calibration tool provided by the manufacturer. The calibration procedure consists of waving a wand with several markers inside the capture space. The position of the cameras was calibrated with the accuracy of 1 mm. After determining the position of the mocap cameras, we have placed a user-defined coordinate system at the ground level, approximately in the center of the capture space and aligned as shown in Figure 3. We marked this coordinate system on the floor to provide a reference location for the subject.

For the vision system, each camera was first adjusted to equalize the field of view among the cameras. The cameras inside each cluster were also aligned to minimize the vertical disparity. Each cluster was then calibrated by capturing 20-30 images of a checkerboard (15x10 squares, square size: 40.5 mm) to obtain the intrinsic camera parameters (i.e. focal length, optical center, lens distortion) and the relative position and orientation of the cameras in each cluster. The cameras have been calibrated with the re-projection error of about 0.15-0.20 pixels. The Kinect color camera was calibrated using a similar method to obtain the intrinsic parameters. The Kinect depth camera was set to project depth images onto the associated color camera coordinate frame on the fly, eliminating a need for an explicit calibration.

To determine the extrinsic parameters (i.e. absolute position and orientation) of the camera clusters and the Kinect cameras with respect to the motion capture coordinate system, we captured movement of a linear calibration object with two LED markers (approximately 700 mm apart). Once the LED data was collected, we applied the multi-stereo cameras using the algorithm in (Kurillo, ICDSC'2008) which determines the position and orientation of all the cameras in a cluster with respect to the reference camera in that cluster. Next, we calibrated the reference camera of each cluster and the two Kinect cameras to the motion capture coordinate system from the known 3D LED positions (provided by the motion capture) and their corresponding image projections.

The location of the microphones was determined using measuring tape with respect to the motion capture world coordinate system determined in the first step.

Microphone Position (X,Y,Z) in cm Position (X,Y,Z) in inches
Microphone #1 (floor) (41.91, 1.27, 165.1) (16.5, 0.5, 65.0)
Microphone #2 (162.56, 81.28, 2.54) (64.0, 32.0, 1.0)
Microphone #3 (31.75, 81.28, -157.48) (12.5, 32.0, -62.0)
Microphone #4 (-97.79, 81.28, 2.54) (-38.5, 32.0, 1.0)

Temporal Synchronization

The temporal synchronization of the motion capture system cameras was based on the data protocols provided by the Impulse system. The vision cameras were synchronized with a hardware based trigger providing accuracy of synchronization in the micro-second level.

The motion capture server provided the NTP service for the synchronization of the other computer systems. On each computer we installed the Meinberg NTP client which provided an improvement in accuracy as compared to the standard Windows NTP client by correcting the time drift via acceleration or deceleration of the system clock. The temporal synchronization between the different sensors (cameras, Kinect, mocap) was provided via UNIX time stamps which are included in the recordings of each modality.

To improve the synchronization accuracy of the image sequences (either from a digital camera or from a Microsoft Kinect) with the mocap data in software, we manually labeled one of the visible markers on a moving part of the body for a number of frames (usually at least 20 frames). The resulting set of manually labeled 2D pixel locations were then used together with the corresponding camera calibration parameters to recover a least squares estimate of the temporal offset for every image sequence with respect to the reference mocap data.


For each subject we collected the following movements shown in the following table. In addition, we collected background data with and without the chair which can be used for the background segmentation.

Recorded Actions

Action Number of Repetitions per Recording Number of Recordings Approximate Length per Recording
1 Jumping in place 5 5 5 sec
2 Jumping jacks 5 5 7 sec
3 Bending - hands up all the way down 5 5 12 sec
4 Punching (boxing) 5 5 10 sec
5 Waving - two hands 5 5 7 sec
6 Waving - one hand (right) 5 5 7 sec
7 Clapping hands 5 5 5 sec
8 Throwing a ball 1 5 3 sec
9 Sit down then stand up 5 5 15 sec
10 Sit down 1 5 2 sec
11 Stand up 1 5 2 sec
12 T-pose 1 1 5 sec

This set of activities contains dynamic body movements in general. Some activities have dynamics in both upper and lower extremities, i.e., jumping in place, jumping jacks, throwing a ball, etc., whereas other activities have dynamics more in upper extremities, i.e., waving hands, clapping hands, etc. More importantly, this set of activities allows us to collect naturalistic motion data, as the subjects were not specifically instructed how to perform each action. The spontaneous performance is critical for capturing different movement styles for the same action in the database.

All the movements share some common sub-movements. For instance, jumping action is common to both jumping in place and jumping jacks; hand waving action is common to waving one hand, two hands, as well as jumping jacks; sit down and stand up actions are already a part of sit down then stand up action.

Collected Data and File Formats

In the table below, we provide information on the type and the size of data collected per subject by different sensory components in the acquisition system.

System Data Format Collected Data Approximate Size per Subject
PhaseSpace Motion Capture ASCII XYZ 3D position of 43 markers
UNIX time stamps
Frame numbers
0.6 GB
Multi-Stereo Cameras PGM
Bayer format (GRBG)
UNIX time stamps
31 GB
Microsoft Kinect PPM
RGB color images
16-bit depth map
UNIX time stamps
34 GB
Accelerometers (Shimmer) ASCII Acceleration in XYZ axes
UNIX time stamps
4 MB
Microphones WAV Uncompressed audio at 48kHz 150 MB

Missing Data

The following data are missing and are unavailable for download:

Sensor Subject Action Rep notes
Accelerometer 2 7 5
4 8 5
Microphone 2 7 5
4 8 5
Motion Capture 4 8 5
Camera 4 8 5 all clusters
5 2 5 clusters 3 and 4
10 8 5 clusters 3 and 4
Kinect 4 8 5


If you report results obtained from this dataset, we request that you cite the following paper:

F. Ofli, R. Chaudhry, G. Kurillo, R. Vidal and R. Bajcsy. Berkeley MHAD: A Comprehensive Multimodal Human Action Database. In Proceedings of the IEEE Workshop on Applications on Computer Vision (WACV), 2013.


MHAD data can be downloaded in full from the public Google Drive. For convenience the top-level folders of individual data modailities were zipped into 'Zipped top-levels' folder on Google Drive.

Go to Download

Additional Downloads


Example MATLAB code that reads, manipulates and displays 3D
data from motion capture system and Kinect sensors projected into the same
volume using camera calibration parameters.


Example MATLAB code that reads, manipulates and displays 3D
marker positions and skeletal joint locations projected onto a camera image
sequence using camera calibration parameters.


Please email mhad[at] if you have any questions or comments.


The database and accompanying code are released under BSD-2 license:

Copyright (c) 2013, Regents of the University of California
All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.