The proposed MLIC approach (i.e., ML-ANet) mainly consists of two sub-networks including the multi-label learning network (ML-Net) and the adaptation network (ANet). ML-Net uses labeled samples from the source domain to train a multi-label classifier for simultaneous multiple labels prediction of an image. ANet embeds the features from the task-specific layer into the reproducing kernel Hilbert space (RKHS) and matches different distributions optimally using the multi-kernels maximum mean discrepancies (MK-MMD) in RKHS. See Figure 1 for the overall framework of our proposed ML-ANet. A detailed description of the proposed method is given in the following subsections.
2.1 ML-Net (Multi-label Learning Network)
Multi-label learning means that each image is associated with multiple class labels simultaneously. Assume that the training set images can be described as \(I = \{ x_{i} \}\), where \(x_{i}\) represents image i and its corresponding label vector is \(y_{i} = \{ 0,1\}^{c}\). \(y_{i}^{j} = 1\) indicates that the jth label exists in image \(x_{i}\), while \(y_{i}^{j} = 0\) indicates the missing of the jth label in image \(x_{i}\). The MLIC task is essentially about learning a mapping function \(f:x \to y\) from the training set \(\{ (x_{i} ,y_{i} )|1 \le i \le n\}\). In this paper, we considered the MLIC problem as multiple binary classification problems, which means that the samples with the same label were considered as positive samples (i.e., \(y_{i} = 1\)), while the others were considered as negative samples (i.e., \(y_{i} = 0\)).
ML-Net trains multi-label classifiers based on labeled data samples from the source domain. Specifically, an image with a size of 224×224×3 was fed into the ML-Net, and a feature map was extracted through ResNet-50. As shown in Figure 1, the dimension of feature vector was reduced from 4096 to 2048 by the first fully connected layer (FCL1), from 2048 to 256 by FCL2, and from 256 to the number of examined labels by FCL3. The number of parameters in ML-Net is approximately 24 million, which indicates that it is difficult to learn the large number of parameters directly from the source domain. Therefore, we transferred the pre-trained ResNet-50 model on ImageNet dataset to the ML-Net. We trained the three fully connected layers and fine-tuned the other layers. Finally, we used the sigmoid function to calculate the score for each category and used the binary cross-entropy loss as the multi-label classification loss function. For each minibatch, we calculated the loss using the following formulas:
$$ J(\theta ) = - \frac{1}{N}\sum\nolimits_{i = 1}^{N} {[y_{i} \log h_{\theta } (\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{y}_{i} ) + (1 - y_{i} )\log (1 - h_{\theta } (\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{y}_{i} ))]} , $$
(1)
$$ h_{\theta } (x_{i} ) = 1/[1 + \exp ( - \theta^{{\text{T}}} x_{i} )], $$
(2)
where N is the number of training samples, \(h_{\theta } (x_{i} )\) donates the probability of the ith class calculated by the sigmoid function, \(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{y}_{i}\) donates the value of the ML-Net predicted in the ith class, \(y_{i}\) is the ground truth of the ith class, and \(y_{i} \in \{ 0,1\}\).
2.2 A-Net (Adaptation Network)
In deep neural networks, the shallow layers learn general features so that the parameters of shallow layers are universal across different tasks, while the parameters of deep layers depend on specific tasks [25]. This inspired us that our proposed network should focus on the deep task-specific layers. Therefore, we proposed an adaptation network (A-Net) to explore the transferability of one domain with labeled information to another domain without labeled information by embedding MK-MMD loss in the last layers. In transfer learning, the domain with labeled information is treated as the source domain, and the domain without labeled information is considered as the target domain. The data from these two domains are usually under different probability distributions.
In this paper, to align different data distributions in the two domains, we introduced a RKHS where the domain discrepancy was measured by using multiple kernel variants of MMD proposed by Gretton et al. [29]. Specifically, A-Net learns transferable features by using MK-MMD to embed the deep features of FCL2 to RKHS, which can optimally match the source and target domain distributions. Figure 2 gives an intuitive example of domain adaptation using multiple Gaussian kernels. For biased datasets (left), a classifier learned from a source domain cannot transfer well to a target domain. By mapping the samples from the source domain and the target domain to the RKHS space (right), the distinguished and domain-invariant representations can be learned.
Assuming that \(X^{S} = \{ x_{1}^{s} ,x_{2}^{s} ,...,x_{n}^{s} \}\) consists of n samples with the labelled information \(Y^{S} = \{ y_{1}^{s} ,y_{2}^{s} ,...,y_{n}^{s} \}\) in the source domain, and \(X^{t} = \{ x_{1}^{t} ,x_{2}^{t} ,...,x_{n}^{t} \}\) consists of m samples in the target domain without labels, the source domain and the target domain can be described as \(D_{s} = \{ (X^{s} ,Y^{s} )\}\) and \(D_{t} = \{ (X^{t} )\}\), respectively. The probability distributions of the source and target domain embedded in RSHS are denoted as p and q, respectively. The MK-MMD \(d_{k} (p,q)\) is defined as the distance between the means of probability distributions p and q in RKHS. Hence, the squared formula of MK-MMD can be described as follows:
$$ d_{k}^{2} (p,q) \triangleq \left\| {E_{p} [\phi (x^{s} )] - E_{q} [\phi (x^{t} )]} \right\|_{{H_{k} }}^{2} , $$
(3)
where \(H_{k}\) is a RKHS with a characteristic kernel k, \(E_{p} [ \bullet ]\) is the mean of p, \(E_{q} [ \bullet ]\) is the mean of q, and \(\phi ( \bullet )\) is a feature mapping function which maps the features from the original feature space to RKHS. In MK-MMD, we denoted K as a particular family of kernels. Hence,
$$ \begin{gathered} K \triangleq \{ k = \sum\nolimits_{u = 1}^{m} {\beta_{u} k_{u} ,} \sum\nolimits_{u = 1}^{m} {\beta_{u} = 1} ,\beta_{u} \ge 0, \hfill \\ \forall u = \{ 1,...,\;m\} , \hfill \\ \end{gathered} $$
(4)
where \(\{ k_{u} \}\) is the set of u positive definite functions and the constraints on coefficients \(\{ \beta_{u} \}\) are imposed to guarantee that the derived kernel k is characteristic.
Given \(x^{s}\) and \(x^{{s^{^{\prime}} }}\) as independent random variables with distribution p, and \(x^{t}\) and \(x^{{t^{^{\prime}} }}\) as independent random variables with distribution q, the characteristic kernel \(k( \bullet )\) is defined as \(k(x^{s} ,x^{t} ) = \left\langle {\varphi (x^{s} ),\varphi (x^{t} )} \right\rangle\). Hence, the distance between means of probability distributions p and q can be computed as the expectation of kernel functions:
$$ \begin{gathered} d_{k}^{2} (p,q) = \left\| {E_{{x^{s} ,x^{{s^{\prime}}} }} [k(x^{s} ,x^{{s^{\prime}}} )]} \right. \hfill \\ \left. { - 2E_{{x^{s} ,x^{t} }} [k(x^{s} ,x^{t} )] + E_{{x^{t} ,x^{{t^{\prime}}} }} [k(x^{t} ,x^{{t^{\prime}}} )]} \right\|, \hfill \\ \end{gathered} $$
(5)
where \(x^{{s^{\prime}}}\) is an independent copy of \(x^{s}\) with the same distribution, and \(x^{{t^{\prime}}}\) is an independent copy of \(x^{t}\) with the same distribution, \(E_{{x^{s} ,x^{t} }} [ \bullet ]\) is the mean of \(k(x^{s} ,x^{t} )\). \(E_{{x^{s} ,x^{{s^{\prime}}} }} [ \bullet ]\) and \(E_{{x^{t} ,x^{{t^{\prime}}} }} [ \bullet ]\) are similarly defined.
The purpose of A-Net is to minimize the domain discrepancy between the source and target domains. The domain discrepancy can be measured by the distance between the means of the probability distributions from the source and target domains. Therefore, we have:
$$ \mathop {\min }\limits_{{D_{s} ,D_{t} }} D_{F} (D_{s} ,D_{t} ) = \mathop {\min }\limits_{p,q} d_{k}^{2} (p,q), $$
(6)
where \(D_{s}\) and \(D_{t}\) denote the source and target domains, respectively. \(D_{F} ( \bullet )\) denotes the domain discrepancy between the source and target domains in the last fully connected layer of ANet.
2.3 ML-ANet
The loss of ML-ANet consists of the MK-MMD loss and the multi-label classification loss. The objective of minimizing the multi-label classification loss is to improve the distinguishability of features in the source domain, while the goal of minimizing MK-MMD loss is to reduce the discrepancy between the means of the probability distributions in the source and target domains. Thus, we derived the loss function of ML-ANet as:
$$ L = J(\theta ) + \lambda D_{F} , $$
(7)
where \(\lambda\) is the loss weight parameter and a hyper-parameter, \(D_{F}\) denotes the MK-MMD loss, \(J(\theta )\) represents the multi-label classification loss of the source domain, and L is the total loss of the whole ML-ANet, which will be trained by the mini-batch stochastic gradient descent (SGD) algorithm to minimize the loss on the training samples.
Mini-batch SGD is important for the training effect of deep networks, but the calculation of pairwise similarities in mini-batch SGD leads to a computational complexity of \(O(n^{2} )\). To solve this problem, we used an unbiased empirical estimate of MK-MMD proposed by Gretton et al. [29], which can be computed with a complexity of \(O(n)\). We used the unbiased empirical estimate to calculate the square form of MK-MMD as follows:
$$ d_{k}^{2} (p,q) = \frac{2}{{n_{s} }}\sum\nolimits_{i = 1}^{{n_{s} /2}} {g_{k} (z_{i} )} , $$
(8)
$$ \begin{gathered} g_{k} (z_{i} ) = k(x_{2i - 1}^{s} ,x_{2i}^{s} ) + k(x_{2i - 1}^{t} ,x_{2i}^{t} ) - \\ k(x_{2i - 1}^{s} ,x_{2i}^{t} ) - k(x_{2i}^{s} ,x_{2i - 1}^{t} ), \\ \end{gathered} $$
(9)
where \(n_{s}\) denotes the number of variables \(x^{s}\), \(z_{i}\) is the quad-tuple which is denoted as \(z_{i} \triangleq (x_{2i - 1}^{s} ,x_{2i}^{s} ,x_{2i - 1}^{t} ,x_{2i}^{t} )\).
When we train a deep CNN by mini-batch SGD, we only need to consider the gradient of Eq. (8) with respect to each data point \(x_{i}\). To perform a mini-batch update, we computed the gradient of Eq. (7) with respect to the lth layer parameters \(\theta^{l}\) as:
$$ \nabla_{{\theta^{l} }} = \frac{{\partial J(z_{i}^{l} )}}{{\partial \theta^{l} }} + \lambda \frac{{\partial g_{k} (z_{i}^{l} )}}{{\partial \theta^{l} }}. $$
(10)
Given that a kernel k is a linear combination of multiple Gaussian kernels \(\{ k(x_{i} ,x_{j} ) = \exp ( - \left\| {x_{i} - x_{j} } \right\|^{2} /\gamma_{u} )\}\), the gradient \(\partial g_{k} (z_{i}^{l} )/\partial \theta^{l}\) can be easily calculated by using the chain rule. For instance, the gradient of \(k(x_{2i - 1}^{sl} ,x_{2i}^{tl} )\) in \(g_{k} (z_{i}^{l} )\) can be calculated as:
$$ \begin{gathered} \frac{{\partial k(x_{2i - 1}^{sl} ,x_{2i}^{tl} )}}{{\partial w^{l} }} = - \sum\nolimits_{u - 1}^{m} {\frac{{2\beta_{u} }}{{\gamma_{u} }}k_{u} (x_{2i - 1}^{sl} ,x_{2i}^{tl} )} \times \hfill \\ (x_{2i - 1}^{sl} - x_{2i}^{tl} ) \times (x_{2i - 1}^{s(l - 1)} - x_{2i}^{t(l - 1)} )^{{\text{T}}} , \hfill \\ \end{gathered} $$
(11)
where \(x_{i}^{l} = W^{l} x_{i}^{l - 1} + b^{l}\). \(W^{l}\) and \(b^{l}\) represent the coefficient matrix and the bias term from the (l-1)th layer to the lth layer, respectively. In summary, the training process of the entire ML-ANet approach can be described in Algorithm 1.