VideoPose 2.0, released June 27, 2011.

Annotation is stored in MATLAB format in clips.mat. See view_clips.m for usage. The train/dev/test split we used in our paper, Parsing Human Motion with Stretchable Models, is included in struct partitions in clips.mat.

Alternate versions

IF for some reason you want the original, uncropped video frames (for e.g., better background subtraction), OR the cropped images, but every frame (we only used every other frame in the above), you access them from the link above. They are

You can use clips(i).examples(j).imgfile to index the files as before. Also, clips(i).examples(j).cropbox is the box used to crop down to obtain the images in the VideoPose2/images directory.


The dataset consists of 44 short clips, 2-3 seconds in length, with a total of 1,286 frames. We use 26 clips for training, recycle 1 training clip for a development set, and use 18 for testing. The dataset fixes global scale and translation of the person, as is typically assumed in order to avoid confounding detection errors with pose estimation errors.

We developed this dataset for the challenging task of tracking upper and lower arms in conjunction with our CVPR2011 paper, Parsing Human Motion with Stretchable Models. It consists of video clips taken from the TV shows Friends and Lost. We chose to focus on these parts because the remaining upper body parts (head, torso) can be localized with near perfect accuracy with current methods given a detection window, whereas lower arm localization performance of state-of-the- arts methods is still quite poor. Furthermore, most things we are interested in knowing about humans involve the hands—e.g., action recognition, gesture identification and object manipulation.

Clips in the dataset were hand-selected (before developing our system) to highlight natural settings where state-of-the-art methods fail:

  • A highly varied (yet realistic) range of poses:
  • Rapid gesticulation:
  • A significant portion of frames (30%) with foreshortened lower arms: