ML-ANet: A Transfer Learning Approach Using Adaptation Network for Multi-label Image Classification in Autonomous Driving

To reduce the discrepancy between the source and target domains, a new multi-label adaptation network (ML-ANet) based on multiple kernel variants with maximum mean discrepancies is proposed in this paper. The hidden representations of the task-specific layers in ML-ANet are embedded in the reproducing kernel Hilbert space (RKHS) so that the mean-embeddings of specific features in different domains could be precisely matched. Multiple kernel functions are used to improve feature distribution efficiency for explicit mean embedding matching, which can further reduce domain discrepancy. Adverse weather and cross-camera adaptation examinations are conducted to verify the effectiveness of our proposed ML-ANet. The results show that our proposed ML-ANet achieves higher accuracies than the compared state-of-the-art methods for multi-label image classification in both the adverse weather adaptation and cross-camera adaptation experiments. These results indicate that ML-ANet can alleviate the reliance on fully labeled training data and improve the accuracy of multi-label image classification in various domain shift scenarios.


Introduction
Benefitting from the rapid development of deep learning technologies in recent years, applications based on convolutional neural network (CNN) have been extensively developed for advanced driver assistance systems (ADASs) and autonomous vehicles (AVs) [1][2][3][4]. These applications mainly focused on object detection [5,6], object tracking [7], and semantic segmentation [8]. Among these applications, image classification is the fundamental technology to divide images into different classes for better detection, tracking, and semantic segmentation performances.

Image Classification
The methods for image classification can generally be categorized into two categories including single-label image classification (SLIC) [9] and multi-label image classification (MLIC) [10]. SLIC assumes that there is only one category of objects in each image. However, naturalistically collected images often contain multiple categories of objects (e.g., vehicles, cyclists, and pedestrians in a single image) in real world. Therefore, MLIC is needed for safety enhancement of ADASs and AVs, and has attracted more attention in real applications.
Early MLIC algorithms mainly include multi-label k-nearest neighbors (ML-KNN) [11], rank support vector machine (rank-SVM) [8], and multi-label decision tree (ML-DT) [13]. ML-KNN [11] uses the maximum of a posteriori estimation (MAP) to determine the set of labels for test samples based on the traditional KNN algorithm. Rank-SVM [12] uses a rank loss function and the corresponding marginal function as constraints for multi-label learning based on SVM. ML-DT [13] uses the information gain criterion based on multi-label entropy to construct decision trees recursively. These traditional algorithms were extensively applied in MLIC tasks for object detection in the late 1990s to early 2000s. However, since these methods have high computational complexity and low accuracy in predicting rare categories when the samples numbers are imbalanced, their performances were generally unsatisfactory in terms of classification accuracy.
More recently, deep learning technologies have been proposed in MLIC tasks, and significant classification improvements have been achieved. The powerful nonlinear representation capabilities of deep neural networks can learn more effective features from large-scale datasets for better performance. A hypotheses-CNNpooling (HCP) algorithm was proposed in Ref. [14] based on binarized normed gradients (BING) that used cross-hypothesis and max-pooling to fuse the classification results of all candidate regions to obtain images with complete label information. The results showed that better performance was achieved by fusing all candidate regions. Wang et al. [15] used CNN to extract features of the input images and used recurrent neural networks (RNNs) to reduce the label dependency. The results concluded that the combined CNN-RNN method could effectively identify image classes by modeling the label co-occurrence dependency in a joint image/label embedding space. Besides, Zhang et al. [16] combined CNN to predict small object classes in images, and Song et al. [17] used a deep multi-modal CNN for multi-instance multi-label image classification. Results from both studies supported the effectiveness of their algorithms in the examined MLIC tasks.
An effective deep neural networks model requires a large amount of accurately labeled samples for training, because millions or even billions of parameters need to be learned from the labeled samples. However, reliable labels heavily rely on extensive human labor work [18,19]. These time-consuming and labor-intensive labeling work hindered the rapid and widespread applications of these technologies in practical applications. To alleviate this problem, transfer learning was developed for solutions [20].

Transfer Learning
Transfer learning is an effective technique to improve the performances of classifiers in the target domain with the availability of the annotated data in the source domain only. Transfer learning also refers to unsupervised domain adaptation, which can adapt features from labeled source domains to unlabeled target domains, and thereby would greatly reduce the cost of human labeling work [20].
Pan et al. [21] proposed a transfer component analysis (TCA) algorithm. In the subspace of transfer component, the feature distribution discrepancy between the source domain and the target domain was significantly reduced and the separability of the data was retained. Long et al. [22] simultaneously adapted both the marginal and conditional distributions in a principled dimensionality reduction procedure by using joint distribution adaptation (JDA). Their results demonstrated the superiority of JDA in accuracy and efficiency over the compared deep learning and transfer learning methods in classification tasks. Wang et al. [23] developed a balanced distribution adaptation (BDA) and added a balance factor to dynamically measure the importance of edge distribution and conditional distribution to improve the classification accuracy. In Ref. [24], an easy transfer learning (EasyTL) approach was proposed to learn nonparametric transfer features by exploiting intra-domain structures to obtain an image classifier. It was concluded that EasyTL had high computational efficiency and could be directly applied in image classification technologies on resourceconstrained devices such as wearables.
To address the time-consuming and labor-intensive limitations of deep learning algorithms in MLIC applications, transfer learning has been introduced to improve the CNN training process and improve the accuracy in MLIC tasks. Yosinski et al. [25] quantitatively analyzed the features from the CNN encoding process, and found that the encoded features were more effective in transfer learning. Based on their experimental findings, Tajbakhsh et al. [26] concluded that when training a deep CNN model in the target domain, it was better to fine-tune a pre-trained source domain CNN model than to retrain the model in the target domain. Zhang et al. [27] proposed a deep transfer network (DTF) framework which used deep neural networks for cross-domain feature distribution matching. The effectiveness of the algorithm was validated in cross-domain multi-class object recognition tasks. Tzeng et al. [28] analyzed the loss of domain confusion and proposed a deep domain confusion (DDC) algorithm to optimize the objective function of maximizing consistency between the source domain and target domain. The experimental results showed that the learned representations were invariant to domain shifts and thus could be used for MLIC tasks.

Contributions
Although the above-mentioned MLIC algorithms have achieved significant progresses, image classification in complex traffic environments (e.g., hazy or snow weather) based on camera systems is still a challenging task for the development of ADASs and AVs because the generalization capability of the algorithms in real traffic still needs to be improved and the algorithms are easy to fail in cross-domain adaptation. To solve this problem and meet the requirement of high accuracy for practical applications, we proposed an effective deep adaptive neural network method for MLIC tasks, namely, multi-label adaptation network (ML-ANet). Specifically, ML-ANet leveraged transfer learning to transfer knowledge from a well-labeled domain to a similar but different domain with limited or no labels. To effectively use the labeled data in the source domain, we conducted MLIC supervised learning on the source domain data, and used multiple kernel variants of maximum mean discrepancies to distribute the feature maps of the source and target domains to reduce domain discrepancy. The main contributions of this paper can be summarized as follows.
(1) We proposed a new deep adaptation network ML-ANet to learn transferable features for adapting models from a source domain (with labelled information) to a different target domain (without labelled information) in MLIC tasks. (2) The effectiveness of our proposed ML-ANet in various traffic environments has been demonstrated by extensive experiments on three large-scale driving datasets. This suggests that when being applied in ADASs and AVs, our proposed ML-ANet could make the ADASs and AVs adaptable to all-aroundthe-clock illuminations in various weather conditions. (3) Our proposed ML-Net alleviates the reliance on fully labeled training data, and therefore no extensive labor work will be needed for network development. This would promote the development efficiencies of ADASs and AVs.

Proposed Approach
The proposed MLIC approach (i.e., ML-ANet) mainly consists of two sub-networks including the multi-label learning network (ML-Net) and the adaptation network (ANet). ML-Net uses labeled samples from the source domain to train a multi-label classifier for simultaneous multiple labels prediction of an image. ANet embeds the features from the task-specific layer into the reproducing kernel Hilbert space (RKHS) and matches different distributions optimally using the multi-kernels maximum mean discrepancies (MK-MMD) in RKHS. See Figure 1 for the overall framework of our proposed ML-ANet. A detailed description of the proposed method is given in the following subsections.

ML-Net (Multi-label Learning Network)
Multi-label learning means that each image is associated with multiple class labels simultaneously. Assume that the training set images can be described as In this paper, we considered the MLIC problem as multiple binary classification problems, which means that the samples with the same label were considered as positive samples (i.e., y i = 1 ), while the others were considered as negative samples (i.e., y i = 0).
ML-Net trains multi-label classifiers based on labeled data samples from the source domain. Specifically, an image with a size of 224×224×3 was fed into the ML-Net, and a feature map was extracted through ResNet-50. (2021) 34:78 As shown in Figure 1, the dimension of feature vector was reduced from 4096 to 2048 by the first fully connected layer (FCL1), from 2048 to 256 by FCL2, and from 256 to the number of examined labels by FCL3. The number of parameters in ML-Net is approximately 24 million, which indicates that it is difficult to learn the large number of parameters directly from the source domain. Therefore, we transferred the pre-trained ResNet-50 model on ImageNet dataset to the ML-Net. We trained the three fully connected layers and fine-tuned the other layers. Finally, we used the sigmoid function to calculate the score for each category and used the binary crossentropy loss as the multi-label classification loss function.
For each minibatch, we calculated the loss using the following formulas: where N is the number of training samples, h θ (x i ) donates the probability of the ith class calculated by the sigmoid function, ⌢ y i donates the value of the ML-Net predicted in the ith class, y i is the ground truth of the ith class, and y i ∈ {0, 1}.

A-Net (Adaptation Network)
In deep neural networks, the shallow layers learn general features so that the parameters of shallow layers are universal across different tasks, while the parameters of deep layers depend on specific tasks [25]. This inspired us that our proposed network should focus on the deep taskspecific layers. Therefore, we proposed an adaptation network (A-Net) to explore the transferability of one domain with labeled information to another domain without labeled information by embedding MK-MMD loss in the last layers. In transfer learning, the domain with labeled information is treated as the source domain, and the domain without labeled information is considered as the target domain. The data from these two domains are usually under different probability distributions.
In this paper, to align different data distributions in the two domains, we introduced a RKHS where the domain discrepancy was measured by using multiple kernel variants of MMD proposed by Gretton et al. [29]. Specifically, A-Net learns transferable features by using MK-MMD to embed the deep features of FCL2 to RKHS, which can optimally match the source and target domain distributions. Figure 2 gives an intuitive example of domain adaptation using multiple Gaussian kernels. For biased datasets (left), a classifier learned from a source domain cannot transfer well to a target domain. By mapping the samples from the source domain and the target domain to the RKHS space (right), the distinguished and domaininvariant representations can be learned.
Assuming that X S = {x s 1 , x s 2 , ..., x s n } consists of n samples with the labelled information Y S = {y s 1 , y s 2 , ..., y s n } in the source domain, and X t = {x t 1 , x t 2 , ..., x t n } consists of m samples in the target domain without labels, the source domain and the target domain can be described as D s = {(X s , Y s )} and D t = {(X t )} , respectively. The probability distributions of the source and target domain embedded in RSHS are denoted as p and q, respectively. The MK-MMD d k (p, q) is defined as the distance between the means of probability distributions p and q in RKHS. Hence, the squared formula of MK-MMD can be described as follows: where H k is a RKHS with a characteristic kernel k, is the mean of q, and φ(•) is a feature mapping function which maps the features from the original feature space to RKHS. In MK-MMD, we denoted K as a particular family of kernels. Hence, where {k u } is the set of u positive definite functions and the constraints on coefficients {β u } are imposed to guarantee that the derived kernel k is characteristic.
Given x s and x s ′ as independent random variables with distribution p, and x t and x t ′ as independent random variables with distribution q, the characteristic kernel k(•) is defined as k(x s , x t ) = ϕ(x s ), ϕ(x t ) . Hence, the distance where x s ′ is an independent copy of x s with the same distribution, and x t ′ is an independent copy of x t with the same distribution, are similarly defined. The purpose of A-Net is to minimize the domain discrepancy between the source and target domains. The domain discrepancy can be measured by the distance between the means of the probability distributions from the source and target domains. Therefore, we have: where D s and D t denote the source and target domains, respectively. D F (•) denotes the domain discrepancy between the source and target domains in the last fully connected layer of ANet.

ML-ANet
The loss of ML-ANet consists of the MK-MMD loss and the multi-label classification loss. The objective of minimizing the multi-label classification loss is to improve the distinguishability of features in the source domain, while the goal of minimizing MK-MMD loss is to reduce the discrepancy between the means of the probability distributions in the source and target domains. Thus, we derived the loss function of ML-ANet as: where is the loss weight parameter and a hyper-parameter, D F denotes the MK-MMD loss, J (θ) represents the multi-label classification loss of the source domain, and L is the total loss of the whole ML-ANet, which will be trained by the mini-batch stochastic gradient descent (SGD) algorithm to minimize the loss on the training samples.
Mini-batch SGD is important for the training effect of deep networks, but the calculation of pairwise similarities in mini-batch SGD leads to a computational complexity of O(n 2 ) . To solve this problem, we used an unbiased empirical estimate of MK-MMD proposed by Gretton et al. [29], which can be computed with a complexity of O(n) . We used the unbiased empirical estimate to calculate the square form of MK-MMD as follows: where n s denotes the number of variables x s , z i is the quad-tuple which is denoted as z i (x s 2i−1 , x s 2i , x t 2i−1 , x t 2i ). When we train a deep CNN by mini-batch SGD, we only need to consider the gradient of Eq. (8) with respect to each data point x i . To perform a mini-batch update, we computed the gradient of Eq. (7) with respect to the lth layer parameters θ l as: Given that a kernel k is a linear combination of multiple Gaussian kernels {k( the gradient ∂g k (z l i )/∂θ l can be easily calculated by using the chain rule. For instance, the gradient of k(x sl 2i−1 , x tl 2i ) in g k (z l i ) can be calculated as: where x l i = W l x l−1 i + b l . W l and b l represent the coefficient matrix and the bias term from the (l-1)th layer to the lth layer, respectively. In summary, the training process of the entire ML-ANet approach can be described in Algorithm 1.

Algorithm 1
The training process of ML-ANet.

Datasets and Experiment
Two adaptation examinations were conducted to verify the effectiveness of our proposed ML-ANet, i.e., adverse weather adaptation and cross-camera adaptation. Whether a detection system can operate faithfully in different weather conditions is essential for a safe autonomous driving system [30]. This paper mainly addressed the domain shift caused by the conversion between clear weather and hazy weather in the adverse weather adaptation experiment. In the cross-camera adaptation experiment, we examined the effects on alleviating the data bias caused by different resolution and contrast of color cameras under similar weather conditions. The datasets used in our experiment are described in Section 3.1, and the experimental setup is described in Section 3.2.

Datasets
Naturalistic driving dataset is very important for the development of autonomous driving technologies [31][32][33]. Three naturalistic driving datasets were used to train and evaluate our proposed ML-ANet, including Cityscapes [34], Foggy Cityscapes [30], and KITTI [35]. Cityscapes dataset is an urban scene dataset for driving scenarios. Foggy Cityscapes dataset is a synthetic foggy dataset from Cityscapes for semantic foggy scene understanding analysis. KITTI dataset is constructed by images collected from real driving in mid-size cities. Though these three datasets cover various urban scenes, the images vary in style, resolution, and illumination between datasets. The main domain divergence between Foggy Cityscapes and Cityscapes is the synthetic fog effect in Foggy Cityscapes, and KITTI has obvious changes in image resolution, illumination, and urban scenes that are not the case in Cityscapes. Figure 3 illustrates the visual differences between Cityscapes, Foggy Cityscapes, and KITTI. In this paper, we used C to represent Cityscapes, F to represent Foggy Cityscapes, and K to represent KITTI. Therefore, the transfer from Cityscapes to Foggy Cityscapes can be denoted as C→F, and similar expressions can be obtained for the other transfers patterns.
In our experiments, we conducted four transfer tasks including C→F, F→C, C→K, and K→C.

Experiment
The experimental training data consist of source training data with images and their category annotations, and target training data with only images. We extracted three classes (i.e., pedestrians, vehicles, and two-wheelers) from the three datasets for experiments. Table 1 lists the number of sample images for each class in the three employed datasets. Due to the insufficient number of images in the datasets for reliable training, we randomly used five image augmentation skills to expand the dataset including rotation, shift, contrast, scaling and horizontal flipping [36,37]. The size of all images in the experiment was resized to 224 × 224 × 3 and we initialized the model with pretrained weights on ImageNet. Each batch included 32 images from the source and target domains, respectively. We used an optimizer with a momentum of 0.9 and a weight decay of 0.001 in the experiment [38]. Table 2 lists the hyper-parameters used for the training of our ML-ANet. All the experiments were processed on an Intel-i5 9600KF (3.70 GHz) with NVIDIA GeForce RTX 2070 GPU.

Activation function ReLU
Optimizer SGD effectiveness of our method to reduce the divergency between source and target domains. The results when using different methods are shown in the following section.  Table 3. The presented results show that our ML-ANet achieves an average accuracy of 94.83% and 96.85% in transfer tasks of C→F and F→C, respectively, better than the compared transfer learning methods. The classification performance of each object class is also superior to the other compared methods. The results indicate that our proposed ML-ANet could effectively transfer knowledge from a clearly labeled domain to a similar but different domain with limited or no labels. The advantage of our ML-ANet is probably because it can effectively reduce the distribution discrepancy between the source and target domains caused by weather changes through the adaptation network.

Adverse Weather Adaptation
To better quantify the performance of our proposed ML-ANet, the average accuracies of TCA, JDA, BDA, DDC, DAN, and our ML-ANet for transfer tasks C→F and F→C with respect to epoch numbers are respectively shown in Figure 5(a) and Figure 5(b). Where C→F denotes the transfer task from Cityscapes to Foggy Cityscapes, and F→C denotes the transfer task from Foggy Cityscapes to Cityscapes. The illustrated results show that: a) TCA has the lowest accuracy because it only adapts to the marginal distribution and does not need iteration. b) The convergence speed and accuracy of ML-ANet is substantially higher than DDC, indicating that single-kernel MMD cannot sufficiently align the probability distribution of the source and target domains. c) The overall change of the ML-ANet curve is higher than DAN, which demonstrates that ML-ANet has better MLIC performance and can further reduce the divergency between the source domain and the target domain. d) ML-ANet can achieve a promising accuracy with a low epoch number.

Cross-Camera Adaptation
The camera mechanisms and underlying settings can also lead to domain-shift, such as substantial differences in visual appearance and image quality. Therefore, crosscamera adaptation is also an important and effective indicator for measuring the quality of transfer learning. Intuitive examples of ML-ANet for MLIC incross-camera adaptation can be found in Figure 4(b) and Figure 4(c). The quantitative experimental results of cross-camera adaptation are shown in Table 4. Specifically, the average MLIC accuracies of ML-ANet are 78.67% and 70.10% in transfer tasks of C→K and K→C, respectively. The numbers are 2.2% and 1.03% higher than the best  Besides, the pedestrian and vehicle classification accuracies of ML-ANet in task K→C are substantially lower than the numbers in task C→K, which is reasonable as the number of K samples is relatively smaller compared to C, especially the number of vehicles. However, the classification performance of our ML-ANet on two-wheelers in K→C does not show a similar trend, and the classification accuracies of twowheelers are all lower than the accuracies of pedestrians or vehicles in either C→K or K→C. That's probably because two-wheelers are always with bikes or motorcycles which increase the noise in feature learning, and the adaptability of different transfer tasks is different. The classification accuracies of the examined methods with respect to the number of epochs in transfer tasks of C→K and K→C show similar trends to the illustrated results in Figure 5, indicating the superiority of ML-ANet in cross-camera adaptation. In summary, the results show that ML-ANet is effective for domain-shift caused by cross-cameras.
The loss weight parameter λ may also influence the performance of ML-ANet. Figure 6 shows the effect of parameter λ on ML-ANet and DDC performance in C→K and F→C tasks. The other three methods do not include parameter λ, thus only DDC and ML-ANet are compared in Figure 6. Where F→C and C→K denote the transfer from Foggy Cityscapes to Cityscapes and from Cityscapes to KITTI, respectively. The experimental results show that the classification accuracy of ML-ANet is obviously outperforming DDC for any λ in any of the tasks, supporting the advantage of our proposed ML-ANet. The curves of ML-ANet are bell-shaped with initial rises and following decreases when λ increases. This trend is reasonable because the network focuses more on MK-MMD loss when λ initially increases, resulting in the transferability improvement and increasing accuracy of ML-ANet. However, when λ is too large, the training of the network ignores the classification loss, which causes the accuracy of the network to decrease. The illustrated results in Figure 6 show that the best performance of ML-ANet is achieved when λ = 1.0 which is the best trade-off for transferability enhancement in ML-ANet.

Discussion on the Novelties of Our Proposed Method
Different from all the previous methods, the novelties of our proposed method include: (1) The structure of our proposed network is different from the previous ones. Compared with other methods such as DAN, the combination of identity blocks (i.e., the Id_B module in Figure 2), convolutional blocks (i.e., the Conv_B module in Figure 2), and MK-MMD is newly developed to help accelerate the convergence rate of the model and to improve the image classification accuracy and adaptation capability of the network. (2) MK-MMD is innovatively used to simultaneously align the data distribution of multiple labels to enhance the generalization of multi-label image classifiers for ADASs and AVs, which extends the previous work concerning MK-MMD from single-label image classification to multi-label image classification. The results presented above show that our method outperforms the others, both qualitatively and quantitatively on the different urban cross-scenes, which demonstrates that ML-ANet has a better adaptive ability to effectively alleviate domain gap. Other reasons why our method can utilize unlabeled auxiliary data to improve the generalization of the network comes from: (1) the generated feature maps are mapped to RKHS, and (2) the distribution of data in the source and target domains are aligned during network training. Therefore, our well-trained model can effectively perform MLIC in the source and target domains, which indicates that our ML-ANet has a strong generalization capability for multi-label image classification in different urban scenes.
This paper mainly focuses on the moving objects including pedestrians, vehicles, and two-wheelers. The static objects like traffic sign and traffic light are usually much smaller than the mentioned moving objects in images [40,41], therefore challenging the performance of the related methods in the literature. In our future studies, we will focus on developing innovative methods to classify the other objects in various scenarios for traffic safety improvement [42]. Meanwhile, the adaptive object detection and tracking in various weather and illumination scenarios based on the transferred knowledge from daytime dry weather scenarios will also be considered.

Conclusions
(1) To obtain a robust multi-label classifier, a novel and effective method (ML-ANet) is proposed for crossdomain MLIC. The proposed ML-ANet consists of two different sub-networks, ML-Net and A-Net, for multi-label learning and transfer learning, respectively. (2) In adverse weather adaptation, ML-ANet achieves an average accuracy of 94.83% and 96.85% in the transfer tasks of C→F and F→C, respectively. The accuracy of ML-ANet could even be better than the compared methods with a low epoch number.
(3) In cross-camera adaptation, the average accuracy of ML-ANet is 2.2% and 1.03% higher than the best comparison method in the C→K and K→C transfer tasks, respectively, showing its better adaptability. (4) The sensitivity analysis of the loss weight parameter λ show that a good trade-off between the MK-MMD loss and the multi-label classification loss can enhance the feature transferability, and the best performance of ML-ANet is achieved when λ = 1.0. (5) The results from this study demonstrate that our ML-ANet can make ADASs and AVs adaptable to all-around-the-clock illuminations in various weather conditions, and promote the development efficiencies of ADASs and AVs. (6) Our future work will focus on the MLIC of other objects and the adaptive object detection and tracking.