 Original Article
 Open access
 Published:
MLANet: A Transfer Learning Approach Using Adaptation Network for Multilabel Image Classification in Autonomous Driving
Chinese Journal of Mechanical Engineering volumeÂ 34, ArticleÂ number:Â 78 (2021)
Abstract
To reduce the discrepancy between the source and target domains, a new multilabel adaptation network (MLANet) based on multiple kernel variants with maximum mean discrepancies is proposed in this paper. The hidden representations of the taskspecific layers in MLANet are embedded in the reproducing kernel Hilbert space (RKHS) so that the meanembeddings of specific features in different domains could be precisely matched. Multiple kernel functions are used to improve feature distribution efficiency for explicit mean embedding matching, which can further reduce domain discrepancy. Adverse weather and crosscamera adaptation examinations are conducted to verify the effectiveness of our proposed MLANet. The results show that our proposed MLANet achieves higher accuracies than the compared stateoftheart methods for multilabel image classification in both the adverse weather adaptation and crosscamera adaptation experiments. These results indicate that MLANet can alleviate the reliance on fully labeled training data and improve the accuracy of multilabel image classification in various domain shift scenarios.
1 Introduction
Benefitting from the rapid development of deep learning technologies in recent years, applications based on convolutional neural network (CNN) have been extensively developed for advanced driver assistance systems (ADASs) and autonomous vehicles (AVs) [1,2,3,4]. These applications mainly focused on object detection [5, 6], object tracking [7], and semantic segmentation [8]. Among these applications, image classification is the fundamental technology to divide images into different classes for better detection, tracking, and semantic segmentation performances.
1.1 Image Classification
The methods for image classification can generally be categorized into two categories including singlelabel image classification (SLIC) [9] and multilabel image classification (MLIC) [10]. SLIC assumes that there is only one category of objects in each image. However, naturalistically collected images often contain multiple categories of objects (e.g., vehicles, cyclists, and pedestrians in a single image) in real world. Therefore, MLIC is needed for safety enhancement of ADASs and AVs, and has attracted more attention in real applications.
Early MLIC algorithms mainly include multilabel knearest neighbors (MLKNN) [11], rank support vector machine (rankSVM) [8], and multilabel decision tree (MLDT) [13]. MLKNN [11] uses the maximum of a posteriori estimation (MAP) to determine the set of labels for test samples based on the traditional KNN algorithm. RankSVM [12] uses a rank loss function and the corresponding marginal function as constraints for multilabel learning based on SVM. MLDT [13] uses the information gain criterion based on multilabel entropy to construct decision trees recursively. These traditional algorithms were extensively applied in MLIC tasks for object detection in the late 1990s to early 2000s. However, since these methods have high computational complexity and low accuracy in predicting rare categories when the samples numbers are imbalanced, their performances were generally unsatisfactory in terms of classification accuracy.
More recently, deep learning technologies have been proposed in MLIC tasks, and significant classification improvements have been achieved. The powerful nonlinear representation capabilities of deep neural networks can learn more effective features from largescale datasets for better performance. A hypothesesCNNpooling (HCP) algorithm was proposed in Ref. [14] based on binarized normed gradients (BING) that used crosshypothesis and maxpooling to fuse the classification results of all candidate regions to obtain images with complete label information. The results showed that better performance was achieved by fusing all candidate regions. Wang et al. [15] used CNN to extract features of the input images and used recurrent neural networks (RNNs) to reduce the label dependency. The results concluded that the combined CNNRNN method could effectively identify image classes by modeling the label cooccurrence dependency in a joint image/label embedding space. Besides, Zhang et al. [16] combined CNN to predict small object classes in images, and Song et al. [17] used a deep multimodal CNN for multiinstance multilabel image classification. Results from both studies supported the effectiveness of their algorithms in the examined MLIC tasks.
An effective deep neural networks model requires a large amount of accurately labeled samples for training, because millions or even billions of parameters need to be learned from the labeled samples. However, reliable labels heavily rely on extensive human labor work [18, 19]. These timeconsuming and laborintensive labeling work hindered the rapid and widespread applications of these technologies in practical applications. To alleviate this problem, transfer learning was developed for solutions [20].
1.2 Transfer Learning
Transfer learning is an effective technique to improve the performances of classifiers in the target domain with the availability of the annotated data in the source domain only. Transfer learning also refers to unsupervised domain adaptation, which can adapt features from labeled source domains to unlabeled target domains, and thereby would greatly reduce the cost of human labeling work [20].
Pan et al. [21] proposed a transfer component analysis (TCA) algorithm. In the subspace of transfer component, the feature distribution discrepancy between the source domain and the target domain was significantly reduced and the separability of the data was retained. Long et al. [22] simultaneously adapted both the marginal and conditional distributions in a principled dimensionality reduction procedure by using joint distribution adaptation (JDA). Their results demonstrated the superiority of JDA in accuracy and efficiency over the compared deep learning and transfer learning methods in classification tasks. Wang et al. [23] developed a balanced distribution adaptation (BDA) and added a balance factor to dynamically measure the importance of edge distribution and conditional distribution to improve the classification accuracy. In Ref. [24], an easy transfer learning (EasyTL) approach was proposed to learn nonparametric transfer features by exploiting intradomain structures to obtain an image classifier. It was concluded that EasyTL had high computational efficiency and could be directly applied in image classification technologies on resourceconstrained devices such as wearables.
To address the timeconsuming and laborintensive limitations of deep learning algorithms in MLIC applications, transfer learning has been introduced to improve the CNN training process and improve the accuracy in MLIC tasks. Yosinski et al. [25] quantitatively analyzed the features from the CNN encoding process, and found that the encoded features were more effective in transfer learning. Based on their experimental findings, Tajbakhsh et al. [26] concluded that when training a deep CNN model in the target domain, it was better to finetune a pretrained source domain CNN model than to retrain the model in the target domain. Zhang et al. [27] proposed a deep transfer network (DTF) framework which used deep neural networks for crossdomain feature distribution matching. The effectiveness of the algorithm was validated in crossdomain multiclass object recognition tasks. Tzeng et al. [28] analyzed the loss of domain confusion and proposed a deep domain confusion (DDC) algorithm to optimize the objective function of maximizing consistency between the source domain and target domain. The experimental results showed that the learned representations were invariant to domain shifts and thus could be used for MLIC tasks.
1.3 Contributions
Although the abovementioned MLIC algorithms have achieved significant progresses, image classification in complex traffic environments (e.g., hazy or snow weather) based on camera systems is still a challenging task for the development of ADASs and AVs because the generalization capability of the algorithms in real traffic still needs to be improved and the algorithms are easy to fail in crossdomain adaptation. To solve this problem and meet the requirement of high accuracy for practical applications, we proposed an effective deep adaptive neural network method for MLIC tasks, namely, multilabel adaptation network (MLANet). Specifically, MLANet leveraged transfer learning to transfer knowledge from a welllabeled domain to a similar but different domain with limited or no labels. To effectively use the labeled data in the source domain, we conducted MLIC supervised learning on the source domain data, and used multiple kernel variants of maximum mean discrepancies to distribute the feature maps of the source and target domains to reduce domain discrepancy. The main contributions of this paper can be summarized as follows.

(1)
We proposed a new deep adaptation network MLANet to learn transferable features for adapting models from a source domain (with labelled information) to a different target domain (without labelled information) in MLIC tasks.

(2)
The effectiveness of our proposed MLANet in various traffic environments has been demonstrated by extensive experiments on three largescale driving datasets. This suggests that when being applied in ADASs and AVs, our proposed MLANet could make the ADASs and AVs adaptable to allaroundtheclock illuminations in various weather conditions.

(3)
Our proposed MLNet alleviates the reliance on fully labeled training data, and therefore no extensive labor work will be needed for network development. This would promote the development efficiencies of ADASs and AVs.
2 Proposed Approach
The proposed MLIC approach (i.e., MLANet) mainly consists of two subnetworks including the multilabel learning network (MLNet) and the adaptation network (ANet). MLNet uses labeled samples from the source domain to train a multilabel classifier for simultaneous multiple labels prediction of an image. ANet embeds the features from the taskspecific layer into the reproducing kernel Hilbert space (RKHS) and matches different distributions optimally using the multikernels maximum mean discrepancies (MKMMD) in RKHS. See Figure 1 for the overall framework of our proposed MLANet. A detailed description of the proposed method is given in the following subsections.
2.1 MLNet (Multilabel Learning Network)
Multilabel learning means that each image is associated with multiple class labels simultaneously. Assume that the training set images can be described as \(I = \{ x_{i} \}\), where \(x_{i}\) represents image i and its corresponding label vector is \(y_{i} = \{ 0,1\}^{c}\). \(y_{i}^{j} = 1\) indicates that the jth label exists in image \(x_{i}\), while \(y_{i}^{j} = 0\) indicates the missing of the jth label in image \(x_{i}\). The MLIC task is essentially about learning a mapping function \(f:x \to y\) from the training set \(\{ (x_{i} ,y_{i} )1 \le i \le n\}\). In this paper, we considered the MLIC problem as multiple binary classification problems, which means that the samples with the same label were considered as positive samples (i.e., \(y_{i} = 1\)), while the others were considered as negative samples (i.e., \(y_{i} = 0\)).
MLNet trains multilabel classifiers based on labeled data samples from the source domain. Specifically, an image with a size of 224Ã—224Ã—3 was fed into the MLNet, and a feature map was extracted through ResNet50. As shown in Figure 1, the dimension of feature vector was reduced from 4096 to 2048 by the first fully connected layer (FCL1), from 2048 to 256 by FCL2, and from 256 to the number of examined labels by FCL3. The number of parameters in MLNet is approximately 24 million, which indicates that it is difficult to learn the large number of parameters directly from the source domain. Therefore, we transferred the pretrained ResNet50 model on ImageNet dataset to the MLNet. We trained the three fully connected layers and finetuned the other layers. Finally, we used the sigmoid function to calculate the score for each category and used the binary crossentropy loss as the multilabel classification loss function. For each minibatch, we calculated the loss using the following formulas:
where N is the number of training samples, \(h_{\theta } (x_{i} )\) donates the probability of the ith class calculated by the sigmoid function, \(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{y}_{i}\) donates the value of the MLNet predicted in the ith class, \(y_{i}\) is the ground truth of the ith class, and \(y_{i} \in \{ 0,1\}\).
2.2 ANet (Adaptation Network)
In deep neural networks, the shallow layers learn general features so that the parameters of shallow layers are universal across different tasks, while the parameters of deep layers depend on specific tasks [25]. This inspired us that our proposed network should focus on the deep taskspecific layers. Therefore, we proposed an adaptation network (ANet) to explore the transferability of one domain with labeled information to another domain without labeled information by embedding MKMMD loss in the last layers. In transfer learning, the domain with labeled information is treated as the source domain, and the domain without labeled information is considered as the target domain. The data from these two domains are usually under different probability distributions.
In this paper, to align different data distributions in the two domains, we introduced a RKHS where the domain discrepancy was measured by using multiple kernel variants of MMD proposed by Gretton et al. [29]. Specifically, ANet learns transferable features by using MKMMD to embed the deep features of FCL2 to RKHS, which can optimally match the source and target domain distributions. Figure 2 gives an intuitive example of domain adaptation using multiple Gaussian kernels. For biased datasets (left), a classifier learned from a source domain cannot transfer well to a target domain. By mapping the samples from the source domain and the target domain to the RKHS space (right), the distinguished and domaininvariant representations can be learned.
Assuming that \(X^{S} = \{ x_{1}^{s} ,x_{2}^{s} ,...,x_{n}^{s} \}\) consists of n samples with the labelled information \(Y^{S} = \{ y_{1}^{s} ,y_{2}^{s} ,...,y_{n}^{s} \}\) in the source domain, and \(X^{t} = \{ x_{1}^{t} ,x_{2}^{t} ,...,x_{n}^{t} \}\) consists of m samples in the target domain without labels, the source domain and the target domain can be described as \(D_{s} = \{ (X^{s} ,Y^{s} )\}\) and \(D_{t} = \{ (X^{t} )\}\), respectively. The probability distributions of the source and target domain embedded in RSHS are denoted as p and q, respectively. The MKMMD \(d_{k} (p,q)\) is defined as the distance between the means of probability distributions p and q in RKHS. Hence, the squared formula of MKMMD can be described as follows:
where \(H_{k}\) is a RKHS with a characteristic kernel k, \(E_{p} [ \bullet ]\) is the mean of p, \(E_{q} [ \bullet ]\) is the mean of q, and \(\phi ( \bullet )\) is a feature mapping function which maps the features from the original feature space to RKHS. In MKMMD, we denoted K as a particular family of kernels. Hence,
where \(\{ k_{u} \}\) is the set of u positive definite functions and the constraints on coefficients \(\{ \beta_{u} \}\) are imposed to guarantee that the derived kernel k is characteristic.
Given \(x^{s}\) and \(x^{{s^{^{\prime}} }}\) as independent random variables with distribution p, and \(x^{t}\) and \(x^{{t^{^{\prime}} }}\) as independent random variables with distribution q, the characteristic kernel \(k( \bullet )\) is defined as \(k(x^{s} ,x^{t} ) = \left\langle {\varphi (x^{s} ),\varphi (x^{t} )} \right\rangle\). Hence, the distance between means of probability distributions p and q can be computed as the expectation of kernel functions:
where \(x^{{s^{\prime}}}\) is an independent copy of \(x^{s}\) with the same distribution, and \(x^{{t^{\prime}}}\) is an independent copy of \(x^{t}\) with the same distribution, \(E_{{x^{s} ,x^{t} }} [ \bullet ]\) is the mean of \(k(x^{s} ,x^{t} )\). \(E_{{x^{s} ,x^{{s^{\prime}}} }} [ \bullet ]\) and \(E_{{x^{t} ,x^{{t^{\prime}}} }} [ \bullet ]\) are similarly defined.
The purpose of ANet is to minimize the domain discrepancy between the source and target domains. The domain discrepancy can be measured by the distance between the means of the probability distributions from the source and target domains. Therefore, we have:
where \(D_{s}\) and \(D_{t}\) denote the source and target domains, respectively. \(D_{F} ( \bullet )\) denotes the domain discrepancy between the source and target domains in the last fully connected layer of ANet.
2.3 MLANet
The loss of MLANet consists of the MKMMD loss and the multilabel classification loss. The objective of minimizing the multilabel classification loss is to improve the distinguishability of features in the source domain, while the goal of minimizing MKMMD loss is to reduce the discrepancy between the means of the probability distributions in the source and target domains. Thus, we derived the loss function of MLANet as:
where \(\lambda\) is the loss weight parameter and a hyperparameter, \(D_{F}\) denotes the MKMMD loss, \(J(\theta )\) represents the multilabel classification loss of the source domain, and L is the total loss of the whole MLANet, which will be trained by the minibatch stochastic gradient descent (SGD) algorithm to minimize the loss on the training samples.
Minibatch SGD is important for the training effect of deep networks, but the calculation of pairwise similarities in minibatch SGD leads to a computational complexity of \(O(n^{2} )\). To solve this problem, we used an unbiased empirical estimate of MKMMD proposed by Gretton et al. [29], which can be computed with a complexity of \(O(n)\). We used the unbiased empirical estimate to calculate the square form of MKMMD as follows:
where \(n_{s}\) denotes the number of variables \(x^{s}\), \(z_{i}\) is the quadtuple which is denoted as \(z_{i} \triangleq (x_{2i  1}^{s} ,x_{2i}^{s} ,x_{2i  1}^{t} ,x_{2i}^{t} )\).
When we train a deep CNN by minibatch SGD, we only need to consider the gradient of Eq. (8) with respect to each data point \(x_{i}\). To perform a minibatch update, we computed the gradient of Eq. (7) with respect to the lth layer parameters \(\theta^{l}\) as:
Given that a kernel k is a linear combination of multiple Gaussian kernels \(\{ k(x_{i} ,x_{j} ) = \exp (  \left\ {x_{i}  x_{j} } \right\^{2} /\gamma_{u} )\}\), the gradient \(\partial g_{k} (z_{i}^{l} )/\partial \theta^{l}\) can be easily calculated by using the chain rule. For instance, the gradient of \(k(x_{2i  1}^{sl} ,x_{2i}^{tl} )\) in \(g_{k} (z_{i}^{l} )\) can be calculated as:
where \(x_{i}^{l} = W^{l} x_{i}^{l  1} + b^{l}\). \(W^{l}\) and \(b^{l}\) represent the coefficient matrix and the bias term from the (l1)th layer to the lth layer, respectively. In summary, the training process of the entire MLANet approach can be described in Algorithm 1.
3 Datasets and Experiment
Two adaptation examinations were conducted to verify the effectiveness of our proposed MLANet, i.e., adverse weather adaptation and crosscamera adaptation. Whether a detection system can operate faithfully in different weather conditions is essential for a safe autonomous driving system [30]. This paper mainly addressed the domain shift caused by the conversion between clear weather and hazy weather in the adverse weather adaptation experiment. In the crosscamera adaptation experiment, we examined the effects on alleviating the data bias caused by different resolution and contrast of color cameras under similar weather conditions. The datasets used in our experiment are described in Section 3.1, and the experimental setup is described in Section 3.2.
3.1 Datasets
Naturalistic driving dataset is very important for the development of autonomous driving technologies [31,32,33]. Three naturalistic driving datasets were used to train and evaluate our proposed MLANet, including Cityscapes [34], Foggy Cityscapes [30], and KITTI [35]. Cityscapes dataset is an urban scene dataset for driving scenarios. Foggy Cityscapes dataset is a synthetic foggy dataset from Cityscapes for semantic foggy scene understanding analysis. KITTI dataset is constructed by images collected from real driving in midsize cities. Though these three datasets cover various urban scenes, the images vary in style, resolution, and illumination between datasets. The main domain divergence between Foggy Cityscapes and Cityscapes is the synthetic fog effect in Foggy Cityscapes, and KITTI has obvious changes in image resolution, illumination, and urban scenes that are not the case in Cityscapes. Figure 3 illustrates the visual differences between Cityscapes, Foggy Cityscapes, and KITTI. In this paper, we used C to represent Cityscapes, F to represent Foggy Cityscapes, and K to represent KITTI. Therefore, the transfer from Cityscapes to Foggy Cityscapes can be denoted as Câ†’F, and similar expressions can be obtained for the other transfers patterns. In our experiments, we conducted four transfer tasks including Câ†’F, Fâ†’C, Câ†’K, and Kâ†’C.
3.2 Experiment
The experimental training data consist of source training data with images and their category annotations, and target training data with only images. We extracted three classes (i.e., pedestrians, vehicles, and twowheelers) from the three datasets for experiments. Table 1 lists the number of sample images for each class in the three employed datasets.
Due to the insufficient number of images in the datasets for reliable training, we randomly used five image augmentation skills to expand the dataset including rotation, shift, contrast, scaling and horizontal flipping [36, 37]. The size of all images in the experiment was resized to 224â€‰Ã—â€‰224â€‰Ã—â€‰3 and we initialized the model with pretrained weights on ImageNet. Each batch included 32 images from the source and target domains, respectively. We used an optimizer with a momentum of 0.9 and a weight decay of 0.001 in the experiment [38]. Table 2 lists the hyperparameters used for the training of our MLANet. All the experiments were processed on an Inteli5 9600KF (3.70 GHz) with NVIDIA GeForce RTX 2070 GPU.
To validate the effectiveness of our proposed MLANet, MLKNN [11] and five stateoftheart transfer learning methods (i.e., TCA [21], JDA [22], BDA [23], DDC [24] and DAN [39]) were selected for comparison. All these comparison methods were selected because they have achieved promising MLIC performances with detailed recommended parameters in Refs. [11,21,22,23,24, 28, 39]. In the experiment, we utilized the classification accuracy by following these papers [22, 23, 28, 39] to evaluate the effectiveness of our method to reduce the divergency between source and target domains. The results when using different methods are shown in the following section.
4 Results and Discussion
4.1 Adverse Weather Adaptation
Figure 4(a) and Figure 4(b) show the intuitive examples of MLANet for MLIC tasks in hazy and clear weather, respectively. The experimental results of domain shift between clear weather and hazy weather are presented in Table 3. The presented results show that our MLANet achieves an average accuracy of 94.83% and 96.85% in transfer tasks of Câ†’F and Fâ†’C, respectively, better than the compared transfer learning methods. The classification performance of each object class is also superior to the other compared methods. The results indicate that our proposed MLANet could effectively transfer knowledge from a clearly labeled domain to a similar but different domain with limited or no labels. The advantage of our MLANet is probably because it can effectively reduce the distribution discrepancy between the source and target domains caused by weather changes through the adaptation network.
To better quantify the performance of our proposed MLANet, the average accuracies of TCA, JDA, BDA, DDC, DAN, and our MLANet for transfer tasks Câ†’F and Fâ†’C with respect to epoch numbers are respectively shown in Figure 5(a) and Figure 5(b). Where Câ†’F denotes the transfer task from Cityscapes to Foggy Cityscapes, and Fâ†’C denotes the transfer task from Foggy Cityscapes to Cityscapes. The illustrated results show that: a) TCA has the lowest accuracy because it only adapts to the marginal distribution and does not need iteration. b) The convergence speed and accuracy of MLANet is substantially higher than DDC, indicating that singlekernel MMD cannot sufficiently align the probability distribution of the source and target domains. c) The overall change of the MLANet curve is higher than DAN, which demonstrates that MLANet has better MLIC performance and can further reduce the divergency between the source domain and the target domain. d) MLANet can achieve a promising accuracy with a low epoch number.
4.2 CrossCamera Adaptation
The camera mechanisms and underlying settings can also lead to domainshift, such as substantial differences in visual appearance and image quality. Therefore, crosscamera adaptation is also an important and effective indicator for measuring the quality of transfer learning. Intuitive examples of MLANet for MLIC incrosscamera adaptation can be found in Figure 4(b) and Figure 4(c). The quantitative experimental results of crosscamera adaptation are shown in Table 4. Specifically, the average MLIC accuracies of MLANet are 78.67% and 70.10% in transfer tasks of Câ†’K and Kâ†’C, respectively. The numbers are 2.2% and 1.03% higher than the best comparison method DAN in the same two transfer tasks. The presented results indicate that MLANet is superior to the other comparison methods, showing more powerful adaptability.
Besides, the pedestrian and vehicle classification accuracies of MLANet in task Kâ†’C are substantially lower than the numbers in task Câ†’K, which is reasonable as the number of K samples is relatively smaller compared to C, especially the number of vehicles. However, the classification performance of our MLANet on twowheelers in Kâ†’C does not show a similar trend, and the classification accuracies of twowheelers are all lower than the accuracies of pedestrians or vehicles in either Câ†’K or Kâ†’C. Thatâ€™s probably because twowheelers are always with bikes or motorcycles which increase the noise in feature learning, and the adaptability of different transfer tasks is different. The classification accuracies of the examined methods with respect to the number of epochs in transfer tasks of Câ†’K and Kâ†’C show similar trends to the illustrated results in Figure 5, indicating the superiority of MLANet in crosscamera adaptation. In summary, the results show that MLANet is effective for domainshift caused by crosscameras.
The loss weight parameter Î» may also influence the performance of MLANet. Figure 6 shows the effect of parameter Î» on MLANet and DDC performance in Câ†’K and Fâ†’C tasks. The other three methods do not include parameter Î», thus only DDC and MLANet are compared in Figure 6. Where Fâ†’C and Câ†’K denote the transfer from Foggy Cityscapes to Cityscapes and from Cityscapes to KITTI, respectively. The experimental results show that the classification accuracy of MLANet is obviously outperforming DDC for any Î» in any of the tasks, supporting the advantage of our proposed MLANet. The curves of MLANet are bellshaped with initial rises and following decreases when Î» increases. This trend is reasonable because the network focuses more on MKMMD loss when Î» initially increases, resulting in the transferability improvement and increasing accuracy of MLANet. However, when Î» is too large, the training of the network ignores the classification loss, which causes the accuracy of the network to decrease. The illustrated results in Figure 6 show that the best performance of MLANet is achieved when Î» = 1.0 which is the best tradeoff for transferability enhancement in MLANet.
4.3 Discussion on the Novelties of Our Proposed Method
Different from all the previous methods, the novelties of our proposed method include: (1) The structure of our proposed network is different from the previous ones. Compared with other methods such as DAN, the combination of identity blocks (i.e., the Id_B module in Figure 2), convolutional blocks (i.e., the Conv_B module in Figure 2), and MKMMD is newly developed to help accelerate the convergence rate of the model and to improve the image classification accuracy and adaptation capability of the network. (2) MKMMD is innovatively used to simultaneously align the data distribution of multiple labels to enhance the generalization of multilabel image classifiers for ADASs and AVs, which extends the previous work concerning MKMMD from singlelabel image classification to multilabel image classification. The results presented above show that our method outperforms the others, both qualitatively and quantitatively on the different urban crossscenes, which demonstrates that MLANet has a better adaptive ability to effectively alleviate domain gap.
Other reasons why our method can utilize unlabeled auxiliary data to improve the generalization of the network comes from: (1) the generated feature maps are mapped to RKHS, and (2) the distribution of data in the source and target domains are aligned during network training. Therefore, our welltrained model can effectively perform MLIC in the source and target domains, which indicates that our MLANet has a strong generalization capability for multilabel image classification in different urban scenes.
This paper mainly focuses on the moving objects including pedestrians, vehicles, and twowheelers. The static objects like traffic sign and traffic light are usually much smaller than the mentioned moving objects in images [40, 41], therefore challenging the performance of the related methods in the literature. In our future studies, we will focus on developing innovative methods to classify the other objects in various scenarios for traffic safety improvement [42]. Meanwhile, the adaptive object detection and tracking in various weather and illumination scenarios based on the transferred knowledge from daytime dry weather scenarios will also be considered.
5 Conclusions

(1)
To obtain a robust multilabel classifier, a novel and effective method (MLANet) is proposed for crossdomain MLIC. The proposed MLANet consists of two different subnetworks, MLNet and ANet, for multilabel learning and transfer learning, respectively.

(2)
In adverse weather adaptation, MLANet achieves an average accuracy of 94.83% and 96.85% in the transfer tasks of Câ†’F and Fâ†’C, respectively. The accuracy of MLANet could even be better than the compared methods with a low epoch number.

(3)
In crosscamera adaptation, the average accuracy of MLANet is 2.2% and 1.03% higher than the best comparison method in the Câ†’K and Kâ†’C transfer tasks, respectively, showing its better adaptability.

(4)
The sensitivity analysis of the loss weight parameter Î» show that a good tradeoff between the MKMMD loss and the multilabel classification loss can enhance the feature transferability, and the best performance of MLANet is achieved when Î» = 1.0.

(5)
The results from this study demonstrate that our MLANet can make ADASs and AVs adaptable to allaroundtheclock illuminations in various weather conditions, and promote the development efficiencies of ADASs and AVs.

(6)
Our future work will focus on the MLIC of other objects and the adaptive object detection and tracking.
References
H Gao, B Cheng, J Wang, et al. Object classification using CNNbased fusion of vision and LIDAR in autonomous vehicle environment. IEEE Transactions on Industrial Informatics, 2018, 16(9): 42244231.
G Li, Y Yang, X Qu, et al. A deep learning based image enhancement approach for autonomous driving at night. KnowledgeBased Systems, 2021, 213: 106617.
G Li, Y Yang, T Zhang, et al. Risk assessment based collision avoidance decisionmaking for autonomous vehicles in multiscenarios. Transportation Research Part C: Emerging Technologies, 2021, 122: 102820.
G Li, S Li, S Li, et al. Deep reinforcement learning enabled decisionmaking for autonomous driving at intersections. Automotive Innovation, 2020, 3: 374385.
Y Chen, W Li, C Sakaridis, et al. Domain adaptive faster RCNN for object detection in the wild. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 33393348.
G Li, S E Li, R Zou, et al. Detection of road traffic participants using costeffective arrayed ultrasonic sensors in lowspeed traffic situations. Mechanical Systems and Signal Processing, 2019, 132: 535545.
S E Li, G Li, J Yu, et al. Kalman filterbased tracking of moving objects using linear ultrasonic sensor array for road vehicles. Mechanical Systems and Signal Processing, 2018, 98: 173189.
X Zhang, Z Chen, Q M J Wu, et al. Fast semantic segmentation for scene perception. IEEE Transactions on Industrial Informatics, 2018, 15(2): 11831192.
L Mou, P Ghamisi, X X Zhu. Deep recurrent neural networks for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 2017, 55(7): 36393655.
Z Yan, W Liu, S. Wen, et al. Multilabel image classification by feature attention network. IEEE Access, 2019, 7: 9800598013.
M L Zhang, Z H Zhou, MLKNN: A lazy learning approach to multilabel learning. Pattern Recognition, 2007, 40(7): 20382048.
A Elisseeff, J Weston. A kernel method for multilabelled classification. Neural Information Processing Systems, 2001, 14: 681687.
M L Zhang, Z H Zhou. A review on multilabel learning algorithms. IEEE Transactions on Knowledge and Data Engineering, 2013, 26(8): 18191837.
Y Wei, W Xia, M Lin, et al. HCP: A flexible CNN framework for multilabel image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 38(9): 19011907.
J Wang, Y Yang, J Mao, et al. CNNRNN: A unified framework for multilabel image classification. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 22852294.
J Zhang, Q Wu, C Shen, et al. Multilabel image classification with regional latent semantic dependencies. IEEE Transactions on Multimedia, 2018, 20(10): 28012813.
L Song, J Liu, B Qian, et al. A deep multimodal CNN for multiinstance multilabel image classification. IEEE Transactions on Image Processing, 2018, 27(12): 60256038.
F C Heilbron, J C Niebles. Collecting and annotating human activities in web videos. Proceedings of International Conference on Multimedia Retrieval, 2014: 377384.
G Li, Y Chen, D Cao, et al. Extraction of descriptive driving patterns from driving data using unsupervised algorithms. Mechanical Systems and Signal Processing, 2021, 156: 107589.
S J Pan, Q Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 2009, 22(10): 13451359.
S J Pan, I W Tsang, J T Kwok, et al. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks, 2010, 22(2): 199210.
M Long, J Wang, G Ding, et al. Transfer feature learning with joint distribution adaptation. Proceedings of the IEEE International Conference on Computer Vision, 2013: 2200â€“2207.
J Wang, Y Chen, S Hao, et al. Balanced distribution adaptation for transfer learning. IEEE International Conference on Data Mining, 2017: 11291134.
J Wang, Y Chen, H Yu, et al. Easy transfer learning by exploiting intradomain structures. IEEE International Conference on Multimedia and Expo, 2019: 12101215.
J Yosinski, J Clune, Y Bengio, et al. How transferable are features in deep neural networks?. arXiv preprint arXiv, 2014: arxiv:1411.1792.
N Tajbakhsh, J Y Shin, S R Gurudu, et al. Convolutional neural networks for medical image analysis: Full training or fine tuning? IEEE Transactions on Medical Imaging, 2016, 35(5): 12991312.
X Zhang, F X Yu, S F Chang, et al. Deep transfer network: Unsupervised domain adaptation. arXiv preprint arXiv, 2015: arxiv:1503.00591.
E Tzeng, J Hoffman, N Zhang, et al. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv, 2014: arxiv:1412.3474.
A Gretton, K M Borgwardt, M J Rasch, et al. A kernel twosample test. The Journal of Machine Learning Research, 2012, 13(1): 723773.
C Sakaridis, D Dai, L Van Gool. Semantic foggy scene understanding with synthetic data. International Journal of Computer Vision, 2018, 126(9): 973992.
G Li, Y Wang, F Zhu, et al. Driversâ€™ visual scanning behavior at signalized and unsignalized intersections: A naturalistic driving study in China. Journal of Safety Research, 2019, 71: 219229.
G Li, W Lai, X Sui, et al. Influence of traffic congestion on driver behavior in postcongestion driving. Accident Analysis and Prevention, 2020, 141:105508.
G Li, S E Li, B Cheng, et al. Estimation of driving style in naturalistic highway traffic using maneuver transition probabilities. Transportation Research Part C: Emerging Technologies, 2017, 74: 113125.
M Cordts, M Omran, S Ramos, et al. The cityscapes dataset for semantic urban scene understanding. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 32133223.
A Geiger, P Lenz, R Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. IEEE Conference on Computer Vision and Pattern Recognition, 2012: 33543361.
G Li, Y Yang, X Qu. Deep learning approaches on pedestrian detection in hazy weather. IEEE Transactions on Industrial Electronics, 2019, 67(10): 88898899.
Q Wen, Z Luo, R Chen, et al. Deep learning approaches on defect detection in high resolution aerial images of insulators. Sensors, 2021, 21: 1033.
Z Wojna, V Ferrari, S Guadarrama, et al. The devil is in the decoder. British Machine Vision Conference, 2017: 113.
M Long, Y Cao, J Wang, et al. Learning transferable features with deep adaptation networks. International Conference on Machine Learning, 2015: 97105.
G Li, Y Lin, X Qu. An infrared and visible image fusion method based on multiscale transformation and norm optimization. Information Fusion, 2021: https://doi.org/10.1016/j.inffus.2021.02.008
G Li, H Xie, X Qu, et al. Detection of road objects with mall appearance in images for autonomous driving in various traffic situations using a deep learning based approach. IEEE Access, 2020, 8: 211146211172.
G Li, Y Liao, Q Guo, et al. Traffic crash characteristics in Shenzhen, China from 2014 to 2016. International Journal of Environmental Research and Public Health, 2021, 18: 1176.
Acknowledgements
Not applicable.
Funding
Supported by Shenzhen Fundamental Research Fund of China (Grant No. JCYJ20190808142613246), National Natural Science Foundation of China (Grant No. 51805332), and Young Elite Scientists Sponsorship Program funded by the China Society of Automotive Engineers.
Author information
Authors and Affiliations
Contributions
GL was in charge of the whole trial; ZJ and GL wrote the original manuscript; XQ reviewed and edited the original manuscript; XQ and DC assisted on the conceptualization and supervised this research work; ZJ, YC, and SL cooperated on the method development, experiment validation, and formal analysis. All authors read and approved the final manuscript.
Authorsâ€™ Information
Guofa Li, born in 1986, is currently an associate research professor at Institute of Human Factors and Ergonomics, College of Mechatronics and Control Engineering, Shenzhen University, China and also a visiting scholar at Department of Mechanical and Mechatronics Engineering, University of Waterloo, Canada. He received his Ph.D. degree from Tsinghua University, China, in 2016. His research interests include environment perception and decision making for autonomous vehicles.
Zefeng Ji, born in 1996, is currently a master candidate at Institute of Human Factors and Ergonomics, College of Mechatronics and Control Engineering, Shenzhen University, China.
Yunlong Chang, born in 1995, is currently an engineer at Interactive Entertainment Group, Tencent, China. He received his master degree from University of Liverpool, UK, in 2019.
Shen Li, born in 1989, is currently a postdoc at Traffic Operations and Safety Laboratory, University of Wisconsin â€“ Madison, USA. He received his Ph.D. degree from University of Wisconsin â€“ Madison, USA, in 2019. His research interests include cooperative control method of autonomous connected vehicles in intelligent transportation systems.
Xingda Qu, born in 1978, is currently a professor at Institute of Human Factors and Ergonomics, College of Mechatronics and Control Engineering, Shenzhen University, China. He received his PhD degree from Virginia Tech., USA, in 2008. His research interests include transportation safety, occupational safety and health, and human computer interaction.
Dongpu Cao, born in 1978, is currently a professor at Department of Mechanical and Mechatronics Engineering, University of Waterloo, Canada. He received his PhD degree from Concordia University, Canada, in 2008. His research interests include vehicle dynamics, control, and intelligence.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Li, G., Ji, Z., Chang, Y. et al. MLANet: A Transfer Learning Approach Using Adaptation Network for Multilabel Image Classification in Autonomous Driving. Chin. J. Mech. Eng. 34, 78 (2021). https://doi.org/10.1186/s10033021005989
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1186/s10033021005989