DukeMTMC Project

Ergys Ristani Ergys Ristani Ergys Ristani Ergys Ristani Ergys Ristani

People involved:
Ergys Ristani, Francesco Solera, Roger S. Zou, Rita Cucchiara, Carlo Tomasi.


Technical Specifications


New! A faster version of the evaluation kit is now available.

To download all data use this script. Individual data set components can be downloaded below:

File Description
LICENSE.txt Terms for using the data.
ground_truth/trainvalRaw.mat All ground truth bounding boxes for the trainval set derived from the manual annotations
ground_truth/trainval.mat Ground truth bounding boxes for the trainval set, clipped to the regions of interest
calibration/calibration.txt Homographies for all cameras
calibration/camera_position.txt Camera positions/orientations for all cameras
calibration/ROIs.txt Regions of interest where tracking is evaluated
detections/ Deformable Part Model [1] detections for each camera
frames/ All frames are extracted here in .JPG format (after videos download)
masks/ Foreground masks for all frames in .PNG format
3D scene From Goolge Earth
camera topology Top view
tracker output Our tracking system result. [cam, ID, frame, left, top, width, height, worldX, worldY]
devkit DukeMTMC evaluation kit for motchallenge.net

Time Synchronization

Each camera has its own local time starting at frame 1. Below is the total number of frames for each camera:

NumFrames = {359580, 360720, 355380, 374850, 366390, 344400, 337680, 353220}

The master camera is Camera 5.  The first frame of each camera is synchronized to the master camera’s local time as follows:

StartTimes = {5543, 3607, 27244, 31182, 1, 22402, 18968, 46766}

As an example, frame 1.jpg in Camera 1 is synchronized to frame 5543.jpg in Camera 5.

The multi-camera dataset goes from frame 49700 to 356648 = 1 hour and 15 minutes @ 59.940059 fps.


Detections at frame 73248

Person detections are generated using the Deformable Part Model [1]. We provide 8 .mat files, one per camera, with the following data format:

[camera, frame, left, top, right, bottom, …, left, top, right, bottom, S, confidence]

There are 9 [left, top, right, bottom] quadruples, one for the main box and 8 for the parts. ‘S’ is a DPM specific value, and the last value is the detector confidence. We thresholded at -1.

Manual Annotations

Annotation keypoints

The data format for manual keypoint annotations (yellow dots) is:

[feetX, feetY, frame]

Annotators were instructed to click between a person’s feet so that the red vertical guide would be centered on the person’s body. The clicks were recorded on the image plane with click coordinates in [0,1]x[0,1], where the image top left corner is (0,0) and the bottom right corner is (1,1). The person walking in the above image was annotated with keypoint [0.6073, 0.4778, 140857].

Manual annotation data is sparse. We use this data generate ground truth bounding boxes for every video frame.

Ground Truth

Ground truth boxes

The data format for ground truth is:

[camera, ID, frame, left, top, width, height, worldX, worldY, feetX, feetyY]

We use image feetX/Y coordinates (yellow points) to generate bounding boxes [left, top, width, height] for every frame. This process is semi-automatic: We annotate a height scaling parameter for each trajectory in each camera so that the person silhouette would be contained in the rectangle.

World coordinates (worldX, worldY) are generated by projecting annotation points [feetX, feetY] (yellow) from the image plane to the ground plane. Note that the yellow points (feetX,feetY) are used for projection instead of the green points (bounding box bottom center).

The provided world coordinates can be used for ground plane evaluation or across-camera trajectory prediction. Feet annotations can be used for better bounding box generation.

Train/Test Data

We split the dataset into one training/validation set and two test sets, test-easy and test-hard. The test sets are withheld for evaluation on the motchallenge.net server while the training/validation set is public. The partition is given below:

Trainval: 0-50 min (dataset frames 49700-227540)
Test-easy: 60-85 min (dataset frames 263504-356648)
Test-hard: 50-60 min (dataset frames 227541-2263503)

Test-easy is 25 minutes long and has statistics similar to trainval.

Test-hard is only 10 minutes long and contains a group of ~60 people traversing 4 cameras.

This sequence is very challenging due to the high density of people and long occlusions.


The evaluation is executed on the image plane and for all frames of the test set. We report multi-camera performance by IDP, IDR and IDF1 scores [2].

We report single camera performance by additionally providing existing scores from the MOTChallenge devkit. Single camera evaluation assesses a tracker on all cameras separately and then reports the aggregated scores.

In both single- and multi-camera scenarios trackers are ranked by their IDF1 performance. Evaluation uses a 50% bounding box intersection-over-union threshold.

The evaluation script uses ground truth trajectories clipped to each camera’s region of interest. The script also clips the tracker output by discarding any bounding boxes whose bottom center point is outside the ROI. Bounding boxes with bottom center points inside the ROI will be preserved.

Each submission consists of one zip file containing one tracker output file named 'duke.txt'. The original submission file for the baseline method [2] is given for reference. The text file has format [cam, ID, frame, left, top, width, height, worldX, worldY]. World coordinates are optional and frames should remain in each camera's local time reference. The evaluation script will check if the submission contains data in both easy and hard intervals, and then evaluate them independently. Any computed detections outside the test intervals will be ignored.

For benchmark results refer to motchallenge.net. Here you will find the official development kit used for evaluation in the motchallenge server.

Support or Contact

Having trouble with the data or code? Please contact Ergys Ristani or Francesco Solera. We will help you sort it out.


[1] A Discriminatively Trained, Multiscale, Deformable Part Model. P. Felzenszwalb, D. McAllester, D. Ramanan. CVPR 2008

[2] Performance Measures and a Data Set for Multi-Target, Multi-Camera Tracking. E. Ristani, F. Solera, R. S. Zou, R. Cucchiara and C. Tomasi. ECCV 2016 Workshop on Benchmarking Multi-Target Tracking.