Skip to main content
  • Original Article
  • Open access
  • Published:

End-to-End Joint Multi-Object Detection and Tracking for Intelligent Transportation Systems

Abstract

Environment perception is one of the most critical technology of intelligent transportation systems (ITS). Motion interaction between multiple vehicles in ITS makes it important to perform multi-object tracking (MOT). However, most existing MOT algorithms follow the tracking-by-detection framework, which separates detection and tracking into two independent segments and limit the global efficiency. Recently, a few algorithms have combined feature extraction into one network; however, the tracking portion continues to rely on data association, and requires complex post-processing for life cycle management. Those methods do not combine detection and tracking efficiently. This paper presents a novel network to realize joint multi-object detection and tracking in an end-to-end manner for ITS, named as global correlation network (GCNet). Unlike most object detection methods, GCNet introduces a global correlation layer for regression of absolute size and coordinates of bounding boxes, instead of offsetting predictions. The pipeline of detection and tracking in GCNet is conceptually simple, and does not require complicated tracking strategies such as non-maximum suppression and data association. GCNet was evaluated on a multi-vehicle tracking dataset, UA-DETRAC, demonstrating promising performance compared to state-of-the-art detectors and trackers.

1 Introduction

Environment perception is one of the most critical technology of intelligent transportation systems (ITS), because its performance has an important impact on subsequent process of decision making and vehicle control [1,2,3,4]. The complex motion interaction between multiple vehicles in ITS makes it important to perform multi-object tracking (MOT) from the view of both vehicles and roadside [5, 6]. MOT is a basic problem in environment perception, whose goal is to calculate the trajectories of all interested objects from consecutive frames of images. It has a wide range of application scenarios, such as autonomous driving, motion attitude analysis, and traffic monitoring. Recently, MOT has been receiving increasing attention.

Traditional MOT algorithms follow the tracking-by-detection framework, which is split into two modules: Detection and tracking. With the development of object detection, these algorithms achieve excellent performance, approximately dominating the entire MOT domain. The tracking module in a tracking-by-detection framework generally contains three parts: Feature extraction, data association, and lifecycle management. Early tracking methods use simple features to accomplish data association, such as location, shape, and velocity; however, these features have evident deficiencies. Later methods utilize appearance features, especially high-level features from deep neural networks. These appearance features can significantly improve the association accuracy and robustness; however, it leads to an increase in the required calculation. Currently, a few MOT algorithms integrate feature extraction into the detection module, which adds the ReID head to obtain instance-level features for data association. Although these algorithms require less computation, data association is still required to perform motion prediction and set complex tracking strategies, resulting in surplus hyperparameters and a cumbersome inference pipeline.

This paper presents a novel network for end-to-end joint detection and tracking. The network realizes bounding box regression and tracking in the same manner, known as global correlation. Notably, bounding box regression generally uses local features to estimate the offsets between the anchor and ground truth, or to estimate the box size and offset between the key point and feature location. In this paper, the proposed framework intends to regress the absolute coordinate and size of the bounding box, rather than the relative coordinate, or offset. However, in traditional convolutional neural networks, the local feature cannot contain global information when the receptive field is considerably small. The self-attention mechanism allows the features of each location to contain global information; however, its computational complexity is too large to be used on a high-resolution feature map. Hence, this paper introduces the global correlation layer to encode global information into features at each location. Moreover, the correlation vectors generated by the global correlation layer can encode the correlation between the local feature vector Q with the global feature map K. Q and K from the image in the same frame are used while performing object detection; conversely, Q from the image in the previous frame and K from the image in the current frame are used while performing object tracking. In this manner, this paper unifies detection and tracking under the same framework.

This paper performs algorithm evaluation on a vehicle tracking dataset, UA-DETRAC, which is captured from a roadside view, and can be seen as a typical application of environment perception in ITS. GCNet demonstrated competitive performance with 74.04% average precision (AP) and 36 frame/s in detection, 19.10% PR-MOTA and 34 frame/s in tracking. Figure 1 shows some examples of tracking results. To summarize, the main contributions of this paper are as follows:

  1. (1)

    This paper proposes a novel network GCNet to realize end-to-end joint multi-object detection and tracking, serving for onboard and roadside perception of ITS.

  2. (2)

    This paper develops the global correlation layer of GCNet that can encode correlation between the local feature vectors with the global feature map without computational complexity.

  3. (3)

    This paper demonstrates the competitive performance of the GCNet by comparative experiments on UA-DETRAC dataset. The results show the advantages of the proposed framework in both detecting and tracking process.

Figure 1
figure 1

Examples of tracking results on UA-DETRAC dataset

The following of this paper is organized as follows. Section 2 introduces the existing research that is related to this paper. Section 3 provides the methodology of this paper, including network components and implementation details. Section 4 conducts experiments and Section 5 gives the conclusions.

2 Related Works

2.1 Object Detection

With the advancements in deep learning, object detection technology has developed rapidly. Existing object detection algorithms can be divided into two categories: Anchor-based [7,8,9] and anchor-free [10,11,12]. Anchor-based algorithms set a series of anchor boxes and regress offsets between the anchor boxes and ground truth using local features. The methods based on region convolution neural network (R-CNN) utilizes heuristic algorithm [13] and region proposal network (RPN) [7, 14, 15] to generate region proposals as anchor. Most anchor-free algorithms use full convolution networks to estimate the key points of targets, and further obtain the bounding boxes through the key points. These algorithms consider local features for bounding box regression, such that they only obtain the offsets between the anchor boxes or key points and the ground truth, rather than absolve bounding box coordination. Detection transformer (DETR) [12] adopted an encoder-decoder architecture based on transformers to achieve object detection. A transformer can integrate global information into the features at each position; however, the self-attention mechanism of the transformer requires a considerable amount of computation and GPU memory, which is difficult to apply to high-resolution feature maps. In the proposed joint detection and tracking framework, the network detects objects in a single image, and tracks objects in different images. However, the offsets for the same object in different images are hard to define. Hence, this paper introduces a global correlation layer to embed global information into the features at each position for absolute coordinate regression, which can be applied to higher-resolution feature maps, rather than the transformer.

2.2 Tracking-by-Detection

With the improvement in detection accuracy, tracking-by-detection methods [16,17,18] have become mainstream in the field of MOT. Tracking is considered as a data association problem in tracking-by-detection frameworks. Features, such as motion [19], shape [20], and appearance [21, 22], are used to describe the correlation between detections and tracks, and thus, a correlation matrix is established. Algorithms including the Hungary algorithm [23], JPDA [16] and MHT [24], input the correlation matrix to complete data association. Although these algorithms have made significant progress, there are certain drawbacks. First, they do not combine the detector and tracker efficiently, and a majority of them need to perform feature extraction separately, which involves unnecessary computation. Second, they often rely on complicated tracking rules for lifecycle management, resulting in numerous hyperparameters and difficult tuning. In the proposed approach, detection and tracking are performed in the same manner, such that they are well combined and the computation of feature extraction is reduced. Additionally, the proposed approach eliminates the complex tracking rules.

2.3 Joint Detection and Tracking

In the field of MOT, it is an important research direction to combine detection and tracking. With the quick maturity of multi-task learning in deep learning, many methods using a single network to complete detection and tracking tasks by adding ReID feature extraction to existing object detection networks [25,26,27]. Wang et al. [28] proposed the joint detection and embedding (JDE) method that allows target detection and appearance embedding to be learned in a shared model. Bergmann et al. [29] proposed a JDT method that adopts Faster-RCNN framework, and accomplishes tracking by region of interest (RoI) pooling and bounding box regression without data association. Zhou et al. [10] considered current and previous frames as well as a heatmap, rendered from tracked object centers, as inputs, and produces an offset map, which simplifies data association considerably. Peng et al. [30] converted the MOT problem into a pair-wise object detection problem, and proposed chained-tracker method realizing end-to-end joint object detection and tracking. Similarly, this study also provides a new idea for joint detection and tracking. Compared with trackformer [31], which formulate the MOT task as a frame-to-frame set prediction problem and propose a tracking-by-attention network based on DETR [12], the network structure of GCNet is simpler and can reach a higher inference speed.

3 Methodology of Global Correlation Network

The proposed network is designed to solve the online MOT problem. At time step \(t\), the network obtains the object trajectories \(\left\{{{\varvec{T}}}_{1},{{\varvec{T}}}_{2},\dots ,{{\varvec{T}}}_{n}\right\}\) from time 0 to time \(t-1\), where \({{\varvec{T}}}_{i}=\left[{{\varvec{B}}}_{i,1},{{\varvec{B}}}_{i,2},\dots ,{{\varvec{B}}}_{i,t-1}\right]\) and \({{\varvec{B}}}_{i,j}\) are the bounding box of the object \(i\) at time \(j\). Considering an image of the current frame \({{\varvec{I}}}_{t}\in {{\varvec{R}}}^{h\times w\times 3}\), the network assigns the bounding boxes \({{\varvec{B}}}_{x,t}\) of objects in the current frame to historical trajectories, or generates new trajectories. The following section introduces the proposed algorithm in detail.

3.1 Global Correlation Network

In this part, the global correlation layer and its application principle in end-to-end joint detection and tracking framework are introduced. Furthermore, the specific implementation of detection module and tracking module in the proposed GCNet are described.

Global correlation layer: The global correlation layer in GCNet encodes global information to generate the correlation vectors, which can be utilized in detection module and tracking module. Using feature map \({\varvec{F}}\in {{\varvec{R}}}^{h\times w\times c}\), two feature maps \({\varvec{Q}}\) and \({\varvec{K}}\) are obtained from the following two linear transformations:

$${{\varvec{Q}}}_{ij}={{\varvec{W}}}_{q}{{\varvec{F}}}_{ij}, {{\varvec{K}}}_{ij}={{\varvec{W}}}_{k}{{\varvec{F}}}_{ij},$$
(1)

where \({{\varvec{F}}}_{ij}\in {{\varvec{R}}}^{c}\) denotes the feature vector at the \(i\) th row and \(j\) th column of \({\varvec{F}}\). Further, for each feature vector \({{\varvec{Q}}}_{ij}\), the cosine distance between it and all \({{\varvec{K}}}_{ij}\) is calculated. Following another linear transformation \(\dot{W}\), the correlation vectors \({{\varvec{C}}}_{ij}\in {R}^{{c}{\prime}}\) is obtained:

$${{\varvec{C}}}_{ij}=\dot{W}\cdot \mathrm{flatten}\left(\left[\begin{array}{ccc}\frac{{{\varvec{Q}}}_{ij}{{\varvec{K}}}_{11}}{\left|{{\varvec{Q}}}_{ij}\right|\left|{{\varvec{K}}}_{11}\right|}& \dots & \frac{{{\varvec{Q}}}_{ij}{{\varvec{K}}}_{1w}}{\left|{{\varvec{Q}}}_{ij}\right|\left|{{\varvec{K}}}_{1w}\right|}\\ \vdots & \ddots & \vdots \\ \frac{{{\varvec{Q}}}_{ij}{{\varvec{K}}}_{h1}}{\left|{{\varvec{Q}}}_{ij}\right|\left|{{\varvec{K}}}_{h1}\right|}& \dots & \frac{{{\varvec{Q}}}_{ij}{{\varvec{K}}}_{hw}}{\left|{{\varvec{Q}}}_{ij}\right|\left|{{\varvec{K}}}_{hw}\right|}\end{array}\right]\right).$$
(2)

These correlation vectors \({{\varvec{C}}}_{ij}\) encode the correlation between the local feature vectors \({{\varvec{Q}}}_{ij}\) with the global feature map \({\varvec{K}}\), such that it can be used to regress the absolute bounding boxes for the objects at the corresponding positions in the image. All of the correlation vectors \({{\varvec{C}}}_{ij}\) can form a correlation map \({\varvec{C}}\in {R}^{h\times w\times {c}{\prime}}\), allowing us to obtain bounding boxes \({\varvec{B}}\in {R}^{h\times w\times 4}\) using a convolution layer with \(1\times 1\) kernel size. \({\varvec{K}}\) and \({\varvec{Q}}\) from the image in the same frame are used while performing object detection; conversely, \({\varvec{Q}}\) from the image in the previous frame and \({\varvec{K}}\) from the image in the current frame are used while performing object tracking. In this manner, detection and tracking are unified under the same framework.

Compared with the traditional self-attention layer, the global correlation layer has advantage in computation. The computation of traditional self-attention layer includes three parts: Computing attention weight, \(c\times \left(h\times w\right)\times {\left(h\times w\right)}^{\mathrm{T}}=c{h}^{2}{w}^{2}\); SoftMax, \(chw\); weighted summation, \(c\times \left(h\times w\right)\times {\left(h\times w\right)}^{\mathrm{T}}=c{h}^{2}{w}^{2}.\) As shown in Eq. (2), the computation of global correlation layer is \(c\times \left(h\times w\right)\times {\left(h\times w\right)}^{\mathrm{T}}=c{h}^{2}{w}^{2}\), which is significantly less than that of the total computation of self-attention layer \((2c+1){h}^{2}{w}^{2}\).

In terms of object classification brunch, this study uses the same network structure and training strategy as CenterNet. When infers, a detection heatmap \({{\varvec{Y}}}_{d}\) and tracking heatmap \({{\varvec{Y}}}_{t}\) are obtained in each frame. The detection heatmap \({{\varvec{Y}}}_{d}\) denotes the detection confidence of the object centers in the current frame, while the tracking heatmap \({{\varvec{Y}}}_{t}\) denotes the tracking confidence between the current and next frame. The peaks in the heatmaps correspond to the detection and tracking key points, and max-pooling is used to obtain the final bounding boxes, without applying box non-maximum suppression (NMS).

$${{\varvec{B}}}_{ij}\in Result, \forall i,j\to \mathrm{maxpool}{\left({\varvec{Y}},\mathrm{3,1}\right)}_{ij}={{\varvec{Y}}}_{ij},$$
(3)

where \(\mathrm{maxpool}\left({\varvec{H}},a,b\right)\) represents a max-pooling layer with kernel size \(a\) and stride \(b\). Hence, the GCNet can realize joint multi-object detection (MOD) and MOT, without complicated post-processes, such as NMS and data association, which have a concise pipeline.

Detection module: The detection module architecture is depicted as Figure 2, which contains three parts: Backbone, classification branch, and regression branch. The backbone is for high-level feature extraction. Because the classification is identical to CenterNet, each location of the feature map corresponds to an object center point, while the resolution of the feature map crucially affects the network performance. To obtain high resolution and retain a large receptive field, the same skip connection structure is acquired as a feature pyramid network (FPN); however, it only outputted the finest level feature map \({\varvec{F}}\). The size of the feature map \({\varvec{F}}\) is \({h}^{\prime}\times {w}^{\prime}\times c\), which is equivalent to \(\frac{h}{8}\times \frac{w}{8}\times c\); here, \(h\) and \(w\) are the height and width of the original image, respectively. This resolution is 4 times that of DETR. The classification branch is a full convolution network, and outputs a confidence map \({{\varvec{Y}}}_{d}\in {R}^{{h}^{\prime}\times {w}^{\prime}\times n}\) with values between 0 and 1. The peaks of the \(i\)th channel of \({{\varvec{Y}}}_{d}\) correspond to the centers of the objects belonging to the \(i\)th category. The regression branch is used to calculate bounding boxes \(\left\{{\left[x,y,h,w\right]}_{i}|1\le i\le N\right\}\). First, this paper considers \({\varvec{F}}\) and \({{\varvec{Y}}}_{d}\) as inputs, and generates three feature maps \({\varvec{K}}\), \({\varvec{Q}}\), and \({\varvec{V}}\).

$$\eqalign{ & {\varvec{Q}} = B{N_Q}\left( {Con{v_Q}\left( {{\varvec{F}},{\text{1}},{\text{1}},c} \right) + {\varvec{P}}} \right), \cr & {\varvec{K}} = Gate\left[ {B{N_K}\left( {Con{v_K}\left( {{\varvec{F}},{\text{1}},{\text{1}},c} \right) + {\varvec{P}}} \right),{{\varvec{Y}}_d}} \right], \cr & {\varvec{V}} = Con{v_V}\left( {{\varvec{F}},{\text{1}},{\text{1}},c} \right), \cr}$$
(4)

where \(Conv\left({\varvec{F}},a,b\right)\) denotes a convolution layer with kernel size \(a\), strides \(b\) and kernel number \(c\), and \(BN\) denotes batch normalization layer. \(Gate\left({\varvec{X}},{\varvec{Y}}\right)\) is depicted in Figure 3, which is a form of spatial attention. \({\varvec{P}}\) is the position embedding with the same shape as \({\varvec{F}}\), and is expressed as:

$$\begin{array}{*{20}c} {P_{ijk} = \left\{ {\begin{array}{*{20}c} {\cos \left( {\frac{4\pi k}{c} + \frac{\pi i}{h}} \right), 0 \le k < \frac{c}{2},} \\ {\cos \left( {\frac{4\pi k}{c} + \frac{\pi j}{w}} \right),\frac{c}{2} \le k < c,} \\ \end{array} } \right.} \\ {0 \le i < h^{\prime}, 0 \le j < w^{\prime}.} \\ \end{array}$$
(5)
Figure 2
figure 2

Detection module architecture

Figure 3
figure 3

Illustration of gate step

The two embedding vectors that are close in the position have a large cosine similarity, while the two that are farther away have a smaller cosine similarity. This attribute reduces the negative influence of similar objects while tracking. Further, the correlation vectors \({{\varvec{C}}}_{ij}\) between \({{\varvec{Q}}}_{ij}\) and \({\varvec{K}}\) are calculated using Eq. (2). The final bounding boxes \({{\varvec{B}}}_{d,ij}=\left[{x}_{ij},{y}_{ij},{h}_{ij},{w}_{ij}\right]\) can be obtained using Eq. (6). Here, the absolute coordinates and size of the bounding box are directly regressed, which differs from most existing methods, especially anchor-based methods.

$${{\varvec{B}}}_{d,ij}=W\cdot BN\left(\left[\begin{array}{cc}{{\varvec{C}}}_{ij}& {{\varvec{V}}}_{ij}\end{array}\right]\right).$$
(6)

Tracking module: Tracking is the process of assigning objects in the current frame to historical tracks, or generating new tracks. The architecture of the tracking module is depicted in Figure 4. The inputs of the tracking module are: (1) Feature map \({\varvec{K}}\) of the current frame, (2) detection confidence map of the current frame, and (3) feature vectors of historical tracks. Additionally, the tracking module outputs a tracking confidence and bounding box for each historical track. It can be observed, this architecture is almost identical to that of the detection module. Most of its network parameters are shared with the detection module, except for the fully connected layer used for calculating tracking confidence (the green block in Figure 4). The tracked bounding boxes are consistent with the detected target boxes in terms of expression, which is \({{\varvec{B}}}_{i}=\left[{x}_{i},{y}_{i},{h}_{i},{w}_{i}\right]\), with absolute coordinates and size. The tracking confidences indicate whether the objects are still present in the image of the current frame. The tracking module functions in an object-wise manner, such that it can naturally pass the ID of each object to the next frame, which is similar to parallel single-object tracking.

Figure 4
figure 4

Tracking module architecture

3.2 Training

Although the proposed model can be trained end-to-end, the GCNet is trained in two stages in this study. First, the detection module is trained, and then, the entire network is fine-tuned. The training strategy of the classification branch is consistent with CornerNet. A heatmap \({{\varvec{Y}}}_{gt}\in {R}^{h\mathrm{^{\prime}}\times w\mathrm{^{\prime}}\times n}\) with 2D Gaussian kernel is defined as follows:

$$\begin{array}{cc}& {Y}_{gt,ijk}=\underset{1\le n\le {N}_{k}}{\mathrm{max}}\left({G}_{ijn}\right),\\ & {G}_{ijn}=\mathrm{exp}\left[-\frac{{\left(i-{x}_{n}\right)}^{2}}{2{\sigma }_{x,n}^{2}}-\frac{{\left(i-{y}_{n}\right)}^{2}}{2{\sigma }_{y,n}^{2}}\right],\end{array}$$
(7)

where \({N}_{k}\) is the number of objects of class \(k\), \(\left[{x}_{n},{y}_{n}\right]\) is the center of object \(n\), and variance \({\sigma }^{2}\) is relative to the object size. \({\sigma }_{x}\) and \({\sigma }_{y}\) are expressed as shown in Eq. (8), and \({\eta }_{\text{IoU}}\) is set to \(0.3\).

$$\begin{array}{c}{\sigma }_{x}=\frac{h\left(1-{\eta }_{\text{IoU}}\right)}{3\left(1+{\eta }_{\text{IoU}}\right)},\\ {\sigma }_{y}=\frac{w\left(1-{\eta }_{\text{IoU}}\right)}{3\left(1+{\eta }_{\text{IoU}}\right)}.\end{array}$$
(8)

The classification loss is a penalty-reduced pixel-wise focal loss.

$$\begin{array}{c}{L}_{d,cla}=-\frac{1}{{h}^{\mathrm{^{\prime}}}{w}^{\mathrm{^{\prime}}}n}\cdot \sum_{ijk}\left\{\begin{array}{ll}{\left(1-{Y}_{d,ijk}\right)}^{2}\mathrm{log}\left({Y}_{d,ijk}\right),& {Y}_{gt,ijk}=1,\\ {\left(1-{Y}_{gt,ijk}\right)}^{2}{Y}_{d,ijk}^{2}\mathrm{log}\left(1-{Y}_{d,ijk}\right),& {Y}_{gt,ijk}\ne 1.\end{array}\right.\end{array}$$
(9)

The regression branch is trained using CIoU loss, as follows:

$${L}_{d,reg}=\sum_{\left[ij\right]=1}{\beta }_{ij}\cdot {L}_{\mathrm{CIoU}}\left({B}_{gt,ij},{B}_{d,ij}\right),$$
(10)

where \(\left[ij\right]=1\) indicates the corresponding \({B}_{d,ij}\), and is assigned to a ground truth. A bounding box \({B}_{d,ij}\) is assigned to a ground truth if \({G}_{ijn}>0.3\) and \(\sum_{n}{G}_{ijn}-{\mathrm{max}}_{n}{G}_{ijn}<0.3\).

$$\left[ij\right]=\left\{\begin{array}{l}1,\begin{array}{rr}& {\exists }_{n}{G}_{ijn}>0.3,\\ & \sum_{n}{G}_{ijn}-\underset{n}{\mathrm{max}}{G}_{ijn}<0.3,\end{array}\\ 0, \text{otherwise.}\end{array}\right.$$
(11)

Furthermore, for \({B}_{ij}\) with \({\mathrm{max}}_{n}{G}_{ijn}=1\), the weight of their regression loss \({w}_{ij}\) is set to \(2\), and the other weights to \(1\). This is done to enhance the precision of the bounding boxes at the center points.

The entire network is fine-tuned using a pretrained detection module. At this training step, two images \({{\varvec{I}}}_{t-i}\) and \({{\varvec{I}}}_{t}\) are treated as inputs simultaneously, where \(i\) lies between \(1\) and \(5\). The loss contains two parts, i.e., detection loss of \({{\varvec{I}}}_{t-i}\) and tracking loss between the two images. The tracking loss also comprises two terms, i.e., regression loss and classification loss. The tracking ground truth is determined by object ID. \({B}_{t,ij}\) and \({Y}_{t,ij}\) are positive if \(\left[ij\right]\) in \({{\varvec{I}}}_{t-i}\) is equal to \(1\), and the corresponding objects exist in \({{\varvec{I}}}_{t}\). The total train loss is expressed as:

$$Loss={L}_{d,cla}+{L}_{t,cla}+0.1\times \left({L}_{d,reg}+{L}_{t,reg}\right).$$
(12)

3.3 Inference Pipeline

The inference pipeline for joint MOD and MOT is described in Algorithm 1. The inputs of the algorithm are consecutive frames of images \({{\varvec{I}}}_{1}-{{\varvec{I}}}_{t}\). Trajectorie \({{\varvec{T}}}_{i}\), confidence \({{\varvec{Y}}}_{i}\), and vector \(\left[{{\varvec{V}}}_{i},{{\varvec{Q}}}_{i}\right]\) of all tracks and candidates are recorded in four collections: \(\varvec{\mathcal{T}}\), \(\varvec{\mathcal{O}}\), \(\varvec{\mathcal{Y}}\), and \(\varvec{\mathcal{C}}\). At each time step, object detection is performed on the current frame of image \({\varvec{I}}\), and tracked the existing track \(\varvec{\mathcal{T}}\) and candidate \(\varvec{\mathcal{C}}\). Tracking confidences are used to update all confidences in sets \(\mathcal{Y}\) and \(\varvec{\mathcal{C}}\), and obtained \({\varvec{Y}}_{i}=\mathrm{min}\left(2\times {\varvec{Y}}_{i}\times {\varvec{Y}}_{t,i},1.5\right)\). The tracks and candidates with a confidence lower than \({p}_{2}\) are deleted, and other trajectories, candidates, and corresponding features are updated. This update strategy, \({\varvec{Y}}_{i}=\mathrm{min}\left(2\times {\varvec{Y}}_{i}\times {\varvec{Y}}_{t,i},1.5\right)\), provides these tracks with a higher tracking confidence, certain trust margin, and confidence possibly greater than \(1\). The detections with an IoU greater than \({p}_{3}\), or confidence less than \({p}_{2}\), are ignored. For the remaining detections, those with a detection confidence greater than \({p}_{1}\) are used to generate new tracks, and the rest are added to the candidate set \(\varvec{\mathcal{C}}\). As observed, the entire detection and tracking process can be performed in sparse mode, such that the overall computational complexity of the algorithm is extremely low.

4 Experiments of the Algorithm

In this section, experiments are carried out to validate the performance of GCNet. Comparison and ablation study are carried out and the results indicate the advantages of the proposed method.

4.1 Benchmark and Implementation Details

Experiments of this study are conducted using the vehicle detection and tracking dataset, UA-DETRAC, which is captured from a roadside view, and can be seen as a typical application of environment perception in ITS. This dataset contains \(100\) sequences; \(60\) were used for training, and the remaining 40 were used for testing. The data in the training and test sets, which are derived from different traffic scenarios, make the test more difficult. The UA-DETRAC benchmark employs AP to rank the performance of the detectors as well as PR-MOTA, PR-MOTP, PR-MT, PR-ML, PR-IDS, PR-FM, PR-FP, and PR-FN scores for tracking evaluation. This paper refers to Ref. [32] for further details on the metrics.

Algorithm 1: Inference pipeline of GCNet

Input:

Output:

continuous frame images \({\varvec{I}}_{1}-{\varvec{I}}_{t}\)

object trajectories \(\varvec{\mathcal{T}}=\left[{\varvec{T}}_{1},{\varvec{T}}_{2},\dots ,{\varvec{T}}_{n}\right],\)

\({\varvec{T}}_{i}=\left[{\varvec{B}}_{i,1},{\varvec{B}}_{i,2},\dots ,{\varvec{B}}_{i,t-1}\right]\), \(B\) denotes the bounding box

1

Initialize: Trajectory set \(\varvec{\mathcal{T}}=\varnothing\); confidence set \(\varvec{\mathcal{Y}}=\varnothing\); feature set \(\varvec{\mathcal{O}}=\varnothing\); candidate set \(\varvec{\mathcal{C}}=\varnothing\); and hyperparameters \({p}_{1}\), \({p}_{2}\), and \({p}_{3}\)

2

for \({\varvec{I}}\) in \({\varvec{I}}_{2}-{\varvec{I}}_{t}\) do

3

\({\varvec{Q}}, {\varvec{K}}, {\varvec{V}}, {\varvec{B}}=DetectionModule\left({\varvec{I}}\right)\);

4

for \({\varvec{T}}_{i}\) in \(\varvec{\mathcal{T}}\) do

5

\({\varvec{B}}_{t,i},{\varvec{Y}}_{t,i}=TrackingModule\left({\varvec{Q}}_{i},{\varvec{K}},{\varvec{V}}_{i}\right)\);

6

Update \({\varvec{Y}}_{i}=\mathrm{min}\left(2\times {\varvec{Y}}_{i}\times {\varvec{Y}}_{t,i}, 1.5\right);\)

7

if \({\varvec{Y}}_{i}<{p}_{2}\) then

8

Delete \({\varvec{T}}_{i}\) from \(\varvec{\mathcal{T}}\);

9

else

10

Add \({\varvec{B}}_{t,i}\) to \({\varvec{T}}_{i}\);

11

Update \({\varvec{Q}}_{i}={\varvec{K}}_{mn}, {\varvec{V}}_{i}={\varvec{V}}_{mn}\),

12

where \(\left(m,n\right)\) is the center of \({\varvec{B}}_{t,i}\);

13

end

14

end

15

for \({\varvec{C}}_{i}=\left[{\varvec{Y}}_{i},{\varvec{Q}}_{i},{\varvec{V}}_{i},{\varvec{B}}_{i}\right]\) in \(\mathcal{C}\) do

16

\({\varvec{B}}_{t,i},{\varvec{Y}}_{t,i}=TrackingModule\left({\varvec{Q}}_{i},{\varvec{K}},{\varvec{V}}_{i}\right)\);

17

Update \({\varvec{Y}}_{i}=\mathrm{min}\left(2\times {\varvec{Y}}_{i}\times {\varvec{Y}}_{t,i}, 1.5\right);\)

18

if \({\varvec{Y}}_{i}<{p}_{1}\) then

19

Delete \({\varvec{C}}_{i}\) from \(\varvec{\mathcal{C}}\);

20

else

21

Add \({\varvec{B}}_{t,i}\) to \({\varvec{T}}_{i}\);

22

Update \({\varvec{Q}}_{i}={\varvec{K}}_{mn}, {\varvec{V}}_{i}={\varvec{V}}_{mn}\),

23

where \(\left(m,n\right)\) is the center of \({\varvec{B}}_{t,i}\);

24

end

25

end

26

for \({\varvec{B}}_{i}\) in \({\varvec{B}}_{d}\) do

27

if \(\exists j,{\text{IoU}}\left({\varvec{B}}_{i},{\varvec{T}}_{j}\right)>{p}_{3}\) then

28

continue;

29

else if \({\varvec{Y}}_{i}>{\varvec{P}}_{1}\) then

30

Add \({\varvec{T}}_{new}=\left[{\varvec{B}}_{i}\right]\) to \(\varvec{\mathcal{T}}\);

31

Add \(\left[{\varvec{Q}}_{i},{\varvec{V}}_{i}\right]\) to \(\varvec{\mathcal{O}}\);

32

Add \({\varvec{Y}}_{i}\) to \(\varvec{\mathcal{Y}}\);

33

else if \({\varvec{Y}}_{i}>{p}_{2}\) then

34

Add \(\left[{\varvec{Y}}_{i},{\varvec{Q}}_{i}, {\varvec{V}}_{i},{\varvec{B}}_{i}\right]\) to \(\varvec{\mathcal{C}}\);

35

end

36

end

37

end

All the experiments are performed using TensorFlow 2.0. The proposed model is trained with Adam on the complete training dataset of UA-DETRAC. The size of the input images is \(512\times 896\). Three commonly used data augmentation methods are employed: Random horizontal flip, random brightness adjustment, and scale adjustment. Hyperparameters \({p}_{1}\), \({p}_{2}\), and \({p}_{3}\) for the inference are set to \(0.5\), \(0.3\), and \(0.5\) respectively.

4.2 Ablation Study

In the proposed joint detection and tracking framework, three main components influence the performance: 1) Gate by confidence map \({{\varvec{Y}}}_{d}\); 2) concatenated feature vector in \({\varvec{V}}\) for bounding box regression; and 3) specially designed position embedding \({\varvec{P}}\). The detection effects of the three models are compared with the GCNet to demonstrate the effectiveness of these components. Table 1 shows the results of the comparison. The full version of the GCNet exhibited the best performance, with \(74.04\mathrm{\%}\) AP on UA-DETRAC. The gate and feature vector of \({\varvec{V}}\) both yielded \(2\mathrm{\%}\) AP. The gate step explicitly merges the classification result into the regression branch, which plays a role of spatial attention and is conducive to the training of the regression branch. The concatenated feature vectors of \({\varvec{V}}\) for regression introduce more texture and local information, which is not included in the correlation vectors. This information is beneficial for inferring the size of the objects. To demonstrate the role of the position embedding, it is replaced with a normal explicit position embedding, where \({P}_{ijk}\) equals \(i\) when \(0\le k<c/2\), and equals \(j\) when \(c/2\le k<c\). Notably, the self-designed position embedding attains a 5.80% increase in AP.

Table 1 Ablation study results

The ablation study is conducted only on the detection benchmark. This is because the tracking module shares most of its parameters with the detection module, and the tracking performance is highly correlated with the detection performance. The results of the ablation study can thus be extended to the tracking module.

4.3 Benchmark Evaluation

Table 2 shows the results obtained using the UA-DETRAC detection benchmark. The GCNet demonstrates promising performance, and outperforms most detection algorithms on this benchmark. It attains a high AP on full and medium difficulty as well as on night and rainy images of the test set. Figure 5 shows the PR curves of the GCNet and other algorithms, exposed by the UA-DETRAC dataset. It can be observed that the proposed model is far more effective than the baselines in each scenario. Notably, the proposed model does not employ any other components for better precision, and the backbone network is the original version of ResNet50. Compared with other methods, the performance improvement of GCNet benefits from the global correlation mechanism in the model. In the complex traffic scenarios, there are many non-critical areas such as trees and buildings, as well as many traffic participants with similar appearances. When using correlation convolution for object detection, the correlation between different objects will decrease with the increase of the distance, which can effectively reduce the false and missed detection. When only the detection module of the GCNet is used, it can run at \(36\) frame/s on a single Nvidia 2080Ti.

Table 2 Results on the UA-DETRAC detection benchmark
Figure 5
figure 5

Precision and recall curves of the detection algorithms

The aim of designing GCNet considers both MOD and MOT. This is the real purpose of introducing the global correlation layer to regress the absolute coordinates. The tracking results are shown in Table 3. The MOT metrics with “PR-” can evaluate the overall effect of detection and tracking. EB and KIoU are the UA-DETRAC challenge winners. In the process of multi-objects tracking, the pixel coordinate distance of the same target among continuous frame images is generally close. Benefiting from the position embedding and global correlation, our method can encode spatiotemporal motion of tracking target implicitly, which can improve the matching accuracy between trajectory in the precious frames and detection results in the current frame. Additionally, a significant PR-MOTA score and an excellent PR-MOTP score are obtained, approximately twice as high as that of EB and KIoU combined. Moreover, the leading scores are obtained in PR-ML and PR-FN on the UA-DETRAC tracking benchmark. Because the detection and tracking modules share most of the features, calculating the entire joint detection and tracking pipeline is approximately the same as calculating detection alone, and it can achieve a speed of approximately 34 frame/s.

Table 3 Results on the UA-DETRAC tracking benchmark

5 Conclusions

This paper proposes a novel joint MOD and MOT network, called GCNet. A global correlation layer is introduced to achieve absolute coordinate and size regression, which performs object detection on a single image, and naturally propagates the ID of objects to the subsequent consecutive frames. Compared to existing tracking-by-detection methods, the GCNet calculates end-to-end object trajectories without a bounding box NMS, data association, and other complex tracking strategies. The proposed method is evaluated on the UA-DETRAC, a vehicle detection and tracking dataset. The results of the experiments indicate that:

  1. (1)

    The evaluation results demonstrate the effectiveness of the proposed approach outperforms the existing methods in both detection and tracking.

  2. (2)

    This approach is also equipped to run 36 frame/s for detection and 34 frame/s for joint detection and tracking, thereby meeting the real-time requirements of most application scenarios, such as onboard environment perception of autonomous vehicles, and roadside perception of ITS.

References

  1. Y Liu, X Guan, P Lu, et al. Research on key issues of consistency analysis of vehicle steering characteristics. Chinese Journal of Mechanical Engineering, 2021, 34: 11.

    Article  Google Scholar 

  2. Q Xu, M Cai, K Li, et al. Coordinated formation control for intelligent and connected vehicles in multiple traffic scenarios. IET Intelligent Transport Systems, 2021, 15(1): 159-173.

    Article  Google Scholar 

  3. Y Luo, D Yang, M Li, et al. Hardware-in-the-loop simulation on dynamical coordinated control method in parallel hybrid electric vehicle (PHEV). Chinese Journal of Mechanical Engineering, 2008, 44(5): 80-85.

    Article  Google Scholar 

  4. M Cai, Q Xu, C Chen, et al. Formation control with lane preference for connected and automated vehicles in multi-lane scenarios. Transportation Research Part C: Emerging Technologies, 2022, 136: 103513.

    Article  Google Scholar 

  5. C Chen, M Cai, J Wang, et al. Cooperation method of connected and automated vehicles at unsignalized intersections: Lane changing and arrival scheduling. IEEE Transactions on Vehicular Technology, 2022, 71(11): 11351-11366.

    Article  Google Scholar 

  6. M Cai, Q Xu, C Chen, et al. Formation control for connected and automated vehicles on multi-lane roads: Relative motion planning and conflict resolution. IET Intelligent Transport Systems, 2023, 17(1): 211-226.

    Article  Google Scholar 

  7. S Ren, K He, R Girshick, et al. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 39(6): 1137-1149.

    Article  Google Scholar 

  8. T Y Lin, P Goyal, R Girshick, et al. Focal loss for dense object detection. Proceedings of the IEEE International Conference on Computer Vision, 2017: 2980-2988.

  9. R Girshick. Fast R-CNN. Proceedings of the IEEE International Conference on Computer Vision, 2015: 1440-1448.

  10. X Zhou, D Wang, P Krähenbühl. Tracking objects as points. Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020: 474-490.

  11. H Law, J Deng. Cornernet: Detecting objects as paired keypoints. Proceedings of the European Conference on Computer Vision (ECCV), 2018: 734-750.

  12. N Carion, F Massa, G Synnaeve, et al. End-to-end object detection with transformers. Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020: 213-229.

  13. A Farhadi, J Redmon. Yolov3: An incremental improvement. Computer Vision and Pattern Recognition, Berlin/Heidelberg, Germany, 2018, 1804: 1-6.

    Google Scholar 

  14. C Y Fu, W Liu, A Ranga, et al. Dssd: Deconvolutional single shot detector. arXiv preprint arXiv: 1701.06659, 2017.

  15. K He, G Gkioxari, P Dollar, et al. Mask R-CNN. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 42(2): 386-397

    Article  Google Scholar 

  16. S H Rezatofighi, A Milan, Z Zhang, et al. Joint probabilistic data association revisited. Proceedings of the IEEE International Conference on Computer Vision, 2015: 3047-3055.

  17. A Bewley, Z Ge, L Ott, et al. Simple online and realtime tracking. 2016 IEEE International Conference on Image Processing (ICIP), 2016: 3464-3468.

    Google Scholar 

  18. N Wojke, A Bewley, D Paulus. Simple online and realtime tracking with a deep association metric. 2017 IEEE International Conference on Image Processing (ICIP), 2017: 3645-3649.

    Google Scholar 

  19. E Bochinski, V Eiselein, T Sikora. High-speed tracking-by-detection without using image information. 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), 2017: 1-6.

  20. M Ullah, F A Cheikh, A S Imran. Hog based real-time multi-target tracking in bayesian framework. 13th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), 2016: 416-422.

  21. E Ristani, C Tomasi. Features for multi-target multi-camera tracking and re-identification. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 6036-6046.

  22. X Shi, H Ling, Y Pang, et al. Rank-1 tensor approximation for high-order association in multi-target tracking. International Journal of Computer Vision, 2019, 127(8): 1063-1083.

    Article  MathSciNet  MATH  Google Scholar 

  23. A Sadeghian, A Alahi, S Savarese. Tracking the untrackable: Learning to track multiple cues with long-term dependencies. Proceedings of the IEEE International Conference on Computer Vision, 2017: 300-311.

  24. C Kim, F Li, A Ciptadi, et al. Multiple hypothesis tracking revisited. Proceedings of the IEEE International Conference on Computer Vision, 2015: 4696-4704.

  25. Y Zhang, C Wang, X Wang, et al. Fairmot: On the fairness of detection and re-identification in multiple object tracking. International Journal of Computer Vision, 2021, 129(11): 3069-3087.

    Article  MathSciNet  Google Scholar 

  26. Z Lu, V Rathod, R Votel, et al. Retinatrack: Online single stage joint detection and tracking. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 14668-14678.

  27. P Voigtlaender, M Krause, A Osep, et al. Mots: Multi-object tracking and segmentation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 7942-7951.

  28. Z Wang, L Zheng, Y Liu, et al. Towards real-time multi-object tracking. Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020: 107-122.

  29. P Bergmann, T Meinhardt, L Leal-Taixe. Tracking without bells and whistles. Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 941-951.

  30. J Peng, C Wang, F Wan, et al. Chained-tracker: Chaining paired attentive regression results for end-to-end joint multiple-object detection and tracking. Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020: 145-161.

  31. T Meinhardt, A Kirillov, L Leal-Taixe, et al. Trackformer: Multi-object tracking with transformers. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 8844-8854.

  32. L Wen, D Du, Z Cai, et al. UA-DETRAC: A new benchmark and protocol for multi-object detection and tracking. Computer Vision and Image Understanding, 2020, 193: 102907.

    Article  Google Scholar 

Download references

Acknowledgements

Not applicable.

Funding

Supported by National Key Research and Development Program of China (Grant No. 2021YFB1600402), National Natural Science Foundation of China (Grant No. 52072212), Dongfeng USharing Technology Co., Ltd., China Intelligent and Connected Vehicles (Beijing) Research Institute Co., Ltd., and “Shuimu Tsinghua Scholarship” of Tsinghua University of China.

Author information

Authors and Affiliations

Authors

Contributions

QX, KL, KQL, JW, and DC are in charge of the whole trial; XL and MC write the manuscript; YG and CZ assist with experiments and analysis. All authors read and approved the final manuscript.

Authors’ Information

Qing Xu, received his B.S. and M.S. degrees in automotive engineering from Beihang University, China, in 2006 and 2008 respectively, and the Ph.D. degree in automotive engineering from Beihang University, China, in 2014. During his Ph.D. research, he worked as a visiting scholar with Department of Mechanical Science and Engineering, University of Illinois, Urbana–Champaign, USA. From 2014 to 2016, he had his postdoctoral research at Tsinghua University, China. He is currently working as an assistant research professor with School of Vehicle and Mobility, Tsinghua University, China. His main research interests include decision and control of intelligent vehicles.

Xuewu Lin, received the B.E. degree from Beijing University of Technology, China, in 2018, and the M.S. degree from Tsinghua University, China, in 2021. He is currently an engineer at Horizon Information Technology Co., Ltd., China. His research interests include object detection, multi-object tracking and trajectory prediction.

Mengchi Cai, received his B.E. degree and Ph. D. degree from School of Vehicle and MobilityTsinghua UniversityChina, in 2018 and 2023, respectively. He is currently a postdoctoral researcher at Intelligent and Connected Vehicles LabSchool of Vehicle and MobilityTsinghua University, China. His research interests include connected and automated vehicles, multi-vehicle formation control, and unsignalized intersection cooperation.

Yu-ang Guo, received the B.E. degree in computer science from Beijing Institute of Technology, China, in 2014. He received the M.S. degree in software engineering from Northwest University, China, in 2018. He is currently a Ph.D. candidate at School of Transportation Science and Engineering, Beihang University, China. His current research interests include road detection, image segmentation and multi-sensor fusion.

Chuang Zhang, received the B.E. degree from Southeast University, China, in 2017, and the M.S. degree from Beihang University, Beijing, China, in 2020. He is currently a Ph.D. candidate in mechanical engineering at School of Vehicle and Mobility, Tsinghua University, China. His research interests include object detection, multi-object tracking and multi-sensor fusion.

Kai Li, is the CEO of Dongfeng USharing Technology Co., Ltd., China. He is now a senior engineer. His research interests include autonomous vehicles, intelligent information interaction, and automotive electronics architecture.

Keqiang Li, received the B.E. degree from Tsinghua University, Beijing, China, in 1985, and the M.S. and Ph.D. degrees in mechanical engineering from Chongqing University, China, in 1988 and 1995, respectively. He is currently a professor at School of Vehicle and Mobility, Tsinghua University, China. His main research areas include automotive control system, driver assistance system, and networked dynamics and control. He is leading National Key Project on CAVs (Intelligent and Connected Vehicles) in China.

Jianqiang Wang, received the B.E. and M.S. degrees from Jilin University of Technology, China, in 1994 and 1997, respectively, and the Ph.D. degree from Jilin University, China, in 2002. He is currently a professor at School of Vehicle and Mobility, Tsinghua University, China. His active research interests include intelligent vehicles, driving assistance systems, and driver behavior.

Dongpu Cao, received the Ph.D. degree from Concordia University, Canada, in 2008. He is the Canada research chair of Driver Cognition and Automated Driving, and an associate professor at University of Waterloo, Canada. His current research focuses on driver cognition, automated driving and cognitive autonomous driving.

Corresponding author

Correspondence to Mengchi Cai.

Ethics declarations

Competing Interests

The authors declare no competing financial interests.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xu, Q., Lin, X., Cai, M. et al. End-to-End Joint Multi-Object Detection and Tracking for Intelligent Transportation Systems. Chin. J. Mech. Eng. 36, 138 (2023). https://doi.org/10.1186/s10033-023-00962-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s10033-023-00962-x

Keywords