DukeMTMC Project

Ergys Ristani Ergys Ristani Ergys Ristani Ergys Ristani Ergys Ristani

People involved:
Ergys Ristani, Francesco Solera, Roger S. Zou, Rita Cucchiara, Carlo Tomasi.


ID measures for Multi-Target Tracking


Multi-Target trackers nowadays are complex. In the picture below the interior of the blue circle shows components that popular trackers build upon.

Overview of measures

Even though as researchers we frequently benchmark and advance individual tracker components, the ultimate goal remains improving overall system performance so as to track reliably within and across cameras.

Multiple measures for both single- and multi-camera scenarios have been proposed over the years to quantify tracker performance [1-10]. The above picture illustrates some of the measure acronyms and is by no means exhaustive. Each measure reveals different characteristics of the same tracker.

Evaluation Paradigms

Multi-Target Tracking is an umbrella term for many tracking scenarios. The targets can be people, animals or cars, the video can stream from one or multiple cameras, or targets could additionally merge or split (e.g. cell tracking). As different end-users have different needs, application specific measures have been designed to serve such needs. It is in principle impossible to establish one single measure of performance that satisfies end-users on all scenarios.

Consider the example below where a suspect goes through an airport. Three trackers are deployed, and we are tasked to recommend one tracker to airport security.

Evaluation Paradigms

At the entrance the suspect is tagged as ID1 by each tracker, and each tracker at some point incorrectly assigns to the suspect the tag ID2. Depending on the frequency or length of these confusions, two evaluation paradigms stand out.

Paradigm 1

The first evaluation paradigm examines how often a target is lost or reacquired. It measures errors through identity switches, the sum of fragmentation and merge errors. This paradigm has been useful for researches to help understand where and why trackers make mistakes. According to this paradigm, tracker a is the best choice (1 switch) and b and c are equally worse (7 switches).

Paradigm 2

The second paradigm instead evaluates how often a target is correctly identified, regardless how often it is lost or reacquired.

According to this paradigm, tracker c is the best with 83% identification recall/precision and a and b are equally worse with 67% identification recall/precision. This evaluation paradigm is more useful to end-users, in this case airport security, who would prefer tracker c because it correctly infers who is where more often.

Key Evaluation Steps

Multiple Object Tracking evaluation is carried out in two steps:

  1. Mapping true and computed identities
  2. Computing a score on top of the mapping

In what follows you will find how these steps are designed for ID measures.

In step 1, ID measures compute a mapping between true and computed identities that is strictly 1-1. The picture below gives an illustration.

ID Mapping

A bipartite graph is constructed where the left partition contains true identities and the right partition contains computed identities. Costs on edges represent the number of misassigned frames if two identities were to correspond. The left/right paritions have equal cardinality by including false/positive negative identities for each computed/true identity. The optimal match is found using minimum cost bipartite matching and is displayed below.


In step 2, given the bijective ID mapping, scores are computed on top of it. The scores are precision, recall and F1-score.

ID Scores

They have the following interpretations:

Comparison with CLEAR-MOT

CLEAR-MOT scores

Consider the two trackers in the picture above. According to scoring functions built on top of the popular CLEAR-MOT mapping the two trackers appear to perform equally well.

Both trackers have a MT score of 1 as ground truth is covered 100% by detections. Both trackers receive the same FRG penalty for switching identity 4 times. MOTA is 0.6 for both trackers due to the same FRG penalty.

CLEAR-MOT scores

In the case of Tracker, each of the computed identities b, c, e, g, and f explains at best 20% of the true identity A. Tracker++ however identifies the true identity more often as b explains 80% of A.

This toy example illustrates how a tracker's identification ability is not captured by existing CLEAR-MOT measures. Overall, the following statements can be made about differences between CLEAR-MOT measures and ID measures:

Practical Implications

When evaluating the same trackers on the MOT16 dataset from MOTChallenge and under different evaluation paradigms, differences in ranking are inevitable. This is not to say that one measure is better than the other. It only suggests that different performance measures capture different aspects of the same tracker.


By looking at the identification precision/recall curvers in the figure below, it can be seen that trackers on MOT16 have lower ID recall. This can be attributed to tracker inability to recall identities after occlusions.

ID Precision/Recall

It can also be observed that overall system performance is far from perfect, which makes Mult-Target Tracking an exciting problem with many open challenges.


Ergys Ristani. October 4, 2017.


[1] Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics. K. Bernardin, R. Stiefelhagen. Image Video Proc. 2008

[2] Learning to Associate: HybridBoosted Multi-Target Tracker for Crowded Scene. Y. Li, C. Huang and R. Nevatia. CVPR 2009

[3] An Equalized Global Graph Model-Based Approach for Multi-Camera Object Tracking. W. Chen, L. Cao, X. Chen and K. Huang. IEEE TCAS 2016

[4] Inter-camera Association of Multi-target Tracks by On-Line Learned Appearance Affinity Models. C. H. Kuo, C. Huang and R. Nevatia. ECCV 2010

[5] Framework for Performance Evaluation of Face, Text, and Vehicle Detection and Tracking in Video: Data, Metrics, and Protocol. R. Kasturi, D. Goldgof, P. Soundararajan, V. Manohar, J. Garofolo, R. Bowers, M. Boonstra, V. Korzhova and J. Zhang. TPAMI 2008

[6] Towards the Evaluation of Reproducible Robustness in Tracking-by-Detection. F. Solera, S. Calderara and R. Cucchiara. AVSS 2015

[7] A New Benchmark and Protocol for Multi-Object Detection and Tracking. L. Wen, D. Du, Z. Cai, Z. Lei, M.C. Chang, H. Qi, J. Lim, M.H. Yang andS. Lyu. arXiv CoRR 2015

[8] Track-Clustering Error Evaluation for Track-Based Multi-camera Tracking System Employing Human Re-identification. C. W. Wu, M. T. Zhong, Y. Tsao, S. W. Yang, Y. K. Chen, S. Y. Chien. CVPRW 2017

[9] Evaluating Multi-Object Tracking. K. Smith, D. Gatica-Perez, J. M. Odobez. CVPRWS 2005

[10] Performance Measures and a Data Set for Multi-Target, Multi-Camera Tracking. E. Ristani, F. Solera, R. S. Zou, R. Cucchiara and C. Tomasi. ECCV 2016 Workshop on Benchmarking Multi-Target Tracking.