Opportunities and Challenges: Classification of Skin Disease Based on Deep Learning

Deep learning has become an extremely popular method in recent years, and can be a powerful tool in complex, prior-knowledge-required areas, especially in the field of biomedicine, which is now facing the problem of inadequate medical resources. The application of deep learning in disease diagnosis has become a new research topic in dermatology. This paper aims to provide a quick review of the classification of skin disease using deep learning to summarize the characteristics of skin lesions and the status of image technology. We study the characteristics of skin disease and review the research on skin disease classification using deep learning. We analyze these studies using datasets, data processing, classification models, and evaluation criteria. We summarize the development of this field, illustrate the key steps and influencing factors of dermatological diagnosis, and identify the challenges and opportunities at this stage. Our research confirms that a skin disease recognition method based on deep learning can be superior to professional dermatologists in specific scenarios and has broad research prospects.


Introduction
Skin lesions are a common disease that cause suffering, some of which can have serious consequences, for millions of people globally [1]. Because of its complexity, diversity, and similarity, skin disease can only be diagnosed by dermatologists with long-term clinical experience and is rarely reproducible. It is likely to be misdiagnosed by an inexperienced dermatologist, which can exacerbate the condition and impede appropriate treatment. Thus, it is necessary to provide a quick and reliable method to assist patients and dermatologists in data processing and judgment.
Advances in deep learning have influenced numerous scientific and industrial fields and have realized significant achievements with inspiration from the human nervous system. With the rapid development of deep learning in biomedical data processing, numerous specialists have adopted this technique to acquire more precise and accurate data. With the rapid increase in the amount of available biomedical data including images, medical records, and omics, deep learning has achieved considerable success in a number of medical image processing problems [2][3][4]. In this regard, deep learning is expected to influence the roles of image experts in biomedical diagnosis owing to its ability to perform quick and accurate assessments. This paper presents the characteristics of skin lesions, overviews image techniques, generalizes the developments in deep learning for skin disease classification, and discusses the limitations and direction of automatic diagnosis.
have focused attention on the requirement for healthy survival and social development. The high cost of treatment, repeated illness occurrences, and delays in treatment have brought challenges to the healthy survival and social development.
The accurate diagnosis of a particular skin disease can be a challenging task, mainly for the following reasons. First, there are numerous kinds of dermatoses, nearly 3000 recorded in the literature. Stanford University has developed an algorithm to demonstrate generalizable classification with a new dermatologist-labeled dataset of 129450 clinical images divided into 2032 categories [9]. Figure 1 displays a subset of the full taxonomy; this has been organized clinically and visually by medical experts. Secondly, the complex manifestation of the disease is also a major challenge for doctors. Morphological differences in the appearance of skin lesions directly influence the diagnosis mainly as there can be relatively poor contrast between different skin diseases, which cannot be distinguished without considerable experience. Finally, for different skin diseases, the lesions can be overly similar to be distinguished using only visual information. Different diseases can have similar manifestations and the same  34:112 disease can have different manifestations in different people, body parts, and disease periods [10]. Figure 2 displays sample images demonstrating the difficulty in distinguishing between malignant and benign lesions, which share several visual features. Unlike benign skin diseases, malignant diseases, if not treated promptly, can lead serious consequences. Melanoma [11], for example, is one of the major and most fatal skin cancers. The five-year survival rate of melanoma can be greater than 98% if found in time; this figure in those where spread has occurred demonstrates a significant drop to 17% [12]. In 2015, there were 3.1 million active cases, representing approximately 70% of skin cancer deaths worldwide [13,14]. The diagnosis of skin disease relies on clinical experience and visual perception. However, human visual diagnosis is subjective and lacks accuracy and repeatability, which is not found in computerized skin-image analysis systems. The use of these systems enables inexperienced operators to prescreen patients [15]. Compared with other diseases or applications such as industrial fault diagnosis, the visual manifestation of skin disease is more prominent, facilitating the significant value of deep learning in image recognition with visual sensitivity. Through the study of large detailed images, dermatology can become one of the most suitable medical fields for telemedicine and artificial intelligence (AI). Using imaging methods, it could be possible for deep learning to assist or even replace dermatologists in the diagnosis of skin disease in the near future.

Image Methods
Deep learning is a class of machine learning that automatically learns hierarchical features of data using multiple layers composed of simple and nonlinear modules. It transforms the data into representations that are important for discriminating the data [16]. As early as 1998, the LeNet network was proposed for handwritten digital recognition [17]. However, owing to the lack of computational power, it was difficult to support the required computation. Until 2012, this method was successfully applied and overwhelmingly outperformed previous machine learning methods for visual recognition tasks at a competitive challenge in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [18,19]. This was a breakthrough that used convolutional networks to virtually halve the error rate for object recognition, and precipitated the rapid adoption of deep learning by the computer vision community [16]. Since then, deep learning algorithms have undergone considerable development because of the improved capabilities of hardware such as graphics processing units (GPUs). Different models, such as ZFNet [20], VGG [21], GoogLeNet [22], and   Figure 3); correspondingly, that of humans was approximately 5%. It has dramatically improved tasks in different scientific and industrial fields including not only computer vision but also speech recognition, drug discovery, clinical surgery, and bioinformatics [24][25][26].
The structure of a convolutional neural network (CNN), which is a representative deep learning algorithm, is displayed in Figure 4. The actual model is similar to this figure, in addition to deeper layers and more convolution kernels. A CNN is a type of "feedforward neural network" inspired by human visual perception mechanisms, and can learn a large number of mappings between inputs and outputs without any precise mathematical expression between them. The first convolutional filter of the CNN is used to detect low-order features such as edges, angles, and curves. As the convolutional layer increases, the detected features become more complex [20]. The pooling layer, or named subsampling layer, converts a window into a pixel by taking the maximum or average value [27], which can reduce the size of the feature map. After the image passes the last fully connected layer, the model maps the learned distributed feature to the sample mark space and provides the final classification type. The layout of the CNN is similar to the biological neural network, with sparse structures and shared weights, which can reduce the number of parameters and improve the fitting effect to prevent overfitting. Deep CNNs demonstrate the potential for variable tasks across numerous fine-grained object categories and have unique advantages in the field of image recognition.  The selection of a suitable model is crucial. The Goog-LeNet model, with a structure called inception (Figure 5), is proposed which can not only maintain the sparsity of the network structure but can also use the high computational performance of the dense matrix [22]. GoogLeNet has been learned and used by numerous researchers because of its excellent performance. Therefore, the Google team has further explored and improved it, resulting in an upgraded version of GoogLeNet, Inception v3 [28], which has become the first choice for current research. With Google's Inception v3 CNN architecture pretrained to a high-level accuracy on the 1000 object class of ImageNet, researchers can remove the final classification layer from the network, retrain it with their own dataset, and fine-tune the parameters across all the layers.
Google's TensorFlow [29], Caffe [30], and Theano [31] deep learning frameworks can be used for training. Theano is a Python library and optimizing compiler for manipulating and evaluating mathematical expressions. It pioneered the trend of using symbolic graphs for programming a network; however, it lacks a low-level interface and the inefficiency of the Python interpreter limits its usage. Caffe's ConvNet implementation with numerous extensions being actively added is excellent; however, its support for recurrent networks and language modeling in general is poor. If both CPU and GPU supports are required, additional functions must be implemented. Specifying a new network is fairly easy in TensorFlow using a symbolic graph of vector operations; however, it has a major weakness in terms of modeling flexibility. It has a clean, modular architecture with multiple frontends and execution platforms, and the library can be compiled on Advanced RISC Machines (ARM).
Deep learning has been gradually applied to medical image data, as medical image analysis approaches are considerably similar to computer vision techniques [32]. Although numerous studies were initially undertaken using relatively small datasets and a pretrained deep learning model as a feasibility study, a robust validation of the medical application is required [33][34][35]. Hence, big data from medical images have been collected to validate the feasibility of medical applications [9,36]. For example, Google researchers collected large datasets consisting of more than 120,000 retinal fundus images for diagnosing diabetic retinopathy and demonstrated high sensitivity and specificity for detection [37].
Owing to the development of hardware and advancement of the algorithms, deep learning now includes considerably more functionality than could previously be imagined. Researchers are now more likely to predict and distinguish what is difficult to diagnose with complex mechanisms and similar characterizations [38,39]. Deep learning is a powerful machine learning algorithm for classification while extracting low-to high-level features [40,41]. A key difference in deep learning compared to other diagnostic methods is its self-learning nature. The neural network is not designed by humans; rather, it is designed by the data itself. Table 1 presents several published achievements on disease diagnosis using pictures or clinical images, which proves that deep learning can be compared with professional specialists in certain fields. Furthermore, many researchers have indicated interest in mobile diagnostics that allow the use of mobile technology. Smartphones with sufficient computing power and fast development to extend the versatility and utility could be used to scan, calculate, analyze anytime and anywhere to detect skin disease [42][43][44]. Researchers have developed such a system based on AI that allows users to install apps on their smartphones and analyze and judge suspicious lesions on the body by taking a picture [45].

Skin Disease Classification Using Deep Learning
Using the deep learning technique, the pattern recognition of images can be performed automatically once the program is established. Images can be input to a CNN with high fidelity and important features can be automatically obtained. Therefore, information extraction from images prior to the learning process is not necessary with this technique. In shallow layers, simple features such as the edges within the images are learned. At deep layers near the output layer, more complex high-order features are learned [56]. Different researchers, institutions, and challenges are working on the automatic diagnosis of skin disease, and different deep learning methods have been developed for the recognition of dermatological disease; these have been proven to be effective in numerous fields [57]. For example, the International Skin Imaging Collaboration (ISIC) is a challenge that focuses on the automatic analysis of skin lesions. The goal of the challenge (started in 2017) is to support the research and development of algorithms for the automated diagnosis of melanoma including lesion segmentation, dermoscopic feature detection within a lesion, and classification of melanoma [58,59], which is also the main goal in the field of dermatology [60]. In general, this method is a modeling framework that can learn the functional mapping from the input images to output. The input image is a preprocessed image; the output image is a segmentation mask. The network structure involves a series of convolution and pooling layers, followed by a fully connected layer, followed by a series of unpooling and disconnection operations [61]. The diagnosis of skin diseases typically consists of four components: image acquisition, image preprocessing, feature extraction and classification, and evaluation of the criteria. Image acquisition is the basis for skin classification, and more images typically indicate greater accuracy and better adaptability (for the data size of selected projects, please refer to Table 2). Preprocessing is used to crop and zoom the images and segment lesions for better training. Feature extraction mainly acquires the features of the skin lesions through color, texture, and boundary information. The evaluation of the results is the final step, which is used to judge whether the classification model is reasonable and achieves its objective.

Image Acquisition
Deep learning requires a large number of images to extract disease features. These datasets are typically available from the Internet, open dermatology databases, and hospitals in collaboration with research units, and are labeled by professional dermatologists after removing blurry and distant images. An excellent dataset should be composed of dermoscopic images. Dermoscopy is a noninvasive skin imaging technology that can observe the skin structure at the junction of the epidermis and dermis, and clearly indicate the nature, distribution, arrangement, edge, and shape of pigmented skin lesions. Because of the uncertainty of imaging conditions, such as shooting angle, illumination, and storage pixels, the imaging  Table 3 covering more than a dozen kind of skin diseases, among which melanoma has the greatest probability of occurrence. However, owing to the lack of a unified standard for skin disease images, the labeling of images is time-consuming and labor-intensive, which significantly limits the size of the current public datasets. Therefore, numerous studies have combined multiple datasets for use [43,63].

Image Preprocessing
Effective image quality can improve the generalization ability of a model. Preprocessing can reduce irrelevant information in the image, improve the intensity of the relevant information, simplify the data, and improve the reliability. The general image preprocessing process is as follows: (1) Image segmentation. Skin lesion segmentation is the essential step for the majority of classification tasks. Accurate segmentation contributes to the accuracy, computation time, and error rate of subsequent lesion classification [71,72]. It is crucial for image analysis for the following two reasons. First, the border of a lesion provides important information for accurate diagnosis, including numerous clinical features such as asymmetry and border irregularity. Secondly, the extraction of other important clinical features such as atypical dots and color variegation critically depends on the accuracy of the border detection [8,73]. Given a inputted dermoscopic image (Figure 6a), the goal of the segmentation process is to generate a two-dimensional mask (Figure 6b) that provides an accurate separation between the lesion area and surrounding healthy skin [74].   (2) Resize. Lesions frequently occupy a relatively small area, although skin images can be considerably large [75,76]. Before this task, images for a deep learning network should be preprocessed because the resolution of the original lesion images is typically overly large, which entails a high computation cost [77]. Accurate skin lesion segmentation enhances its capability by incorporating a multiscale contextual information integration scheme [62]. To avoid distorting the shape of the skin lesion, the images should be cropped to the center area first and then proportionally resized. Images are frequently resized to 224×224 or 227×227 pixels through scaling and clipping [78], which is the appropriate size after combining the amount of calculation and information density. The essence of normalization is a kind of linear transformation that does not cause "failure" after changing the data. Conversely, it can improve the performance of the data, accelerate the solution speed of gradient descent, and enhance the convergence speed of the model. (4) Data augmentation. Owing to privacy and professional equipment problems, it is difficult to collect sufficient data in the process of skin disease identification. A data set that is overly small can easily lead to overfitting owing to the lack of learning ability of the model, which makes the network model lack generalization ability. A method called data augmentation is adopted to expand the dataset to meet the requirements of deep learning for big data, such as rotation, random cropping, and noise [79]. Figure 7 displays several methods of image processing by which the image database can be extended to meet the training requirements.

Feature Extraction and Classification
Early detection of lesions is a crucial step in the field of skin cancer treatment. There is a significant benefit if this can be achieved without penetrating the body. Feature extraction of skin disease is an important tool that can be used to properly analyze and explore an image [80]. Feature extraction can be simply viewed as a dimensionality reduction process; that is, converting picture data into a vector of a certain dimension with picture features. Before deep learning, this was typically determined manually by dermatologists or researchers after investigating a large number of digital skin lesion images. A well-known method for feature extraction is based on the ABCD rule of dermoscopy. ABCD stands for asymmetry, border structure, color variation, and lesion diameter. It defines the basis for disease diagnosis [81]. The extracted and fused traits such as color, texture, and Histogram of Oriented Gradient (HOG) are applied subsequently with a serial-based method. The fused features are selected afterwards by implementing a novel Boltzman entropy method [82], which can be used for the early detection. However, this typically has enormous randomness and depends on the quantity and quality of the pictures, as well as the experience of the dermatologists. From a classification perspective, feature extraction has numerous benefits: (i) reducing classifier complexity for better generalization, (ii) improving prediction accuracy, (iii) reducing training and testing time, and (iv) enhancing the understanding and visualization of the data. The mechanism of neural networks is considerably different from that of traditional methods. Visualization indicates that the first layers are essentially calculating edge gradients and other simple operations such as SIFT [83] and HOG [84]. The folded layers combine the local patterns into a more global pattern, ultimately resulting in a more powerful feature extractor. In a study using nearly 130000 clinical dermatology images, 21 certified dermatologists tested the skin lesion classification with a single CNN, directly using pixels and image labels for end-to-end training; this had an accuracy of 0.96 for carcinoma [9]. Subsequently, researchers used deep learning to develop an automated classification system for 12 skin disorders by learning the abnormal characteristics of a malignancy and determined visual explanations from the deep network [47]. A third study combined deep learning with traditional methods such as hand-coded feature extraction and sparse coding to create a collection for melanoma detection that could yield higher performance than expert dermatologists. These results and others [85][86][87] confirm that deep learning has significant potential to reduce doctors' repetitive work. Despite problems, it would be a significant advance if AI could reliably simulate experienced dermatologists.

Evaluation Criteria and Benchmarking
Evaluation and criterion, typically based on the following three points, reliability, time consumption, and training and validation are vital in this field [88]. Researchers [73,89,90] have used all three criteria to develop and design methods and techniques for detecting and diagnosing skin disease. Others [71,91,92] have used only two criteria, reliability, and training and validation to evaluate and discuss the different types of classifiers.
Numerous studies have demonstrated that acceptable reliability, time complexity, and error rates within a dataset cannot be achieved at the same time; hence, researchers must establish different standards. Once one of them is selected, the performance of the others diminishes [90,93]. Consequently, conflicts among dermatological evaluation criteria pose a serious challenge to dermatological classification methods. These requirements must be considered during the evaluation and benchmarking. The dermatological classification method should standardize the requirements and objectives and use a programmatic process in research, evaluation, and benchmarking. Moreover, new flexible evaluations should address all conflicting standards and issues [94]. Despite the conflicts, important criteria are the key goals for evaluation and benchmarking. It is necessary to develop appropriate procedures for these goals while increasing the importance of specific evaluation criteria and decreasing other standards [95]. When evaluating the results obtained using the diagnostic model, researchers must consider the quality of the dataset used to build the model and choose the parameters that can adjust that model. The time complexity and error rate in the dataset have proven to be important in the field of dermatology, which, with more consideration during the evaluation process, can optimize the consistency of the results [63]. In general, the goal is to obtain a balanced classifier for sensitivity and specificity.

Limitations
In general, the advantage of AI is that it can help doctors perform tedious repetitive tasks. For example, if sufficient blood is scanned, an AI-powered microscope can detect low-density infections in micrographs of standard, field-prepared thick blood films, which is considered to be time-consuming, difficult, and tedious owing to the low density and small parasite size and abundance of similar non-parasite objects [49]. The requirement for staff training and purchase of expensive equipment for creating dermoscopic images can be replaced by software using CNNs [96]. In the future, the clinical application of deep learning for the diagnosis of other diseases can be investigated. Transfer learning could be useful in developing CNN models for relatively rare diseases. Models could also evolve such that they require fewer preprocessing steps. In addition to these topics, a deeper understanding of the reconstruction kernel or image thickness could lead to improved deep learning model performance. Positive effects should continue to grow owing to the emergence of higher precision scanners and image reconstruction techniques [56]. However, we must realize that although AI has the ability to defeat humans in several specific fields, in general the performance of AI is considerably less than acceptable in the majority of cases [97]. The main reasons for this are as follows.
(1) Medicine is an area that is not yet fully understood. Information is not completely transparent. The characteristics of dermatology determine that the majority of the data cannot be obtained. At the same time, the AI technology route is immature, the identification accuracy of which must be improved owing to the uncertainty of manual diagnosis. There is no strict correspondence between the symptoms and results of a disease and no clear boundary between the different diseases. Thus, the use of deep learning for disease diagnosis continues to require considerable effort. (2) Before systematic debugging, extensive simulation, and robust validation, flawed algorithms could harm patients, which could lead to medical ethical issues, and therefore require forward-looking scrutiny and stricter regulation [98]. As a "black box", the principle of deep learning is unexplained at this stage, which could result in unpredictable system output. Moreover, it is possible that humans could not truly understand how a machine functions, even though it is actually inspired by humans [99,100]. Hence, whether or not patient care can be accepted using an opaque algorithm remains a point of discussion.
(3) There is a problem with the change in the error rate value in a dataset, which is caused by the change in the size of the dataset used in different skin cancer experiments. Therefore, the lack of a standard dataset can lead to serious problems; the error rate values are considered in many experiments. In addition, the collection of datasets for numerous studies depends on individual research, leading to unnecessary effort and time. When the actual class is manually marked and compared to the predicted class to calculate one of the parameter matrices, pixels are lost when the background is cut from the skin cancer image using Adobe Photoshop [101]. At this point, the process influences the results of all the parameter reliability groups (matrices, relationships, and behaviors), which are considered controversial. High reliability and low rate of time complexity cannot be achieved simultaneously, which is reflected in the training process and is influenced by conflicts between different standards, leading to considerable challenges [93,102]. A method that works for the detection of one skin lesion could possibly not work for the detection of others [103].
Numerous different training and test sets have been used to evaluate the proposed methods. Moreover, for the parameters in the training and evaluation, different researchers are interested in different parts. This lack of uniformity and standardization across all papers makes a fair comparison virtually impossible [50,104]. Although these indicators in the literature have been widely criticized, studies continue to use them to evaluate the application to skin cancer and other image processing fields. (4) The data used for evaluation are frequently overly small to allow a convincing statement regarding a system's performance to be made. Although it is not impossible to collect an abundance of relevant data through the Internet in this information age, this information, with significant uncertainty, apparently cannot meet the requirements of independent and identical distribution, which is one of the important prerequisites for deep learning to be successfully applied. For certain rare diseases and minorities, only a limited number of images are available for training. To date, a large number of algorithms have demonstrated prejudice against minority groups, which could cause a greater gap in health service between the "haves" and the "havenots" [105]. Numerous cases are required for the training process using deep learning techniques. In addition, although the deep learning technique has been successfully applied to other tasks, the developed models in skin are valid in only specific dedi- cated diseases and are not applicable to common situations. Diagnosing dermatology is a complex process that, in addition to image recognition, must be supplemented by other means such as palpation, smell, temperature change, and microscopy.

Prospect
Deep learning has made considerable progress in the field of skin disease recognition. More attempts and explorations in the future can be considered in the following aspects.
(1) Establishment of standardized skin disease image dataset A large amount of data is the basis of skin disease recognition and the premise of acceptable generalization ability of the network model. However, the number of images, disease types, image size, and shooting and processing methods of the published datasets are considerably different, which leads to the confusion of different studies and the loss of the ability to quantitatively describe different models, Moreover, it is difficult to collect images of certain rare diseases. As mentioned above, there are numerous kinds of skin diseases; however, only approximately 20 datasets are available, including less than 20 kinds of skin diseases. There is an urgent requirement to expand access to medical images. For example, Indian researchers have trained neural networks to analyze images from "handheld imaging devices" instead of stationary dermatoscope devices to provide more prospects for early and correct diagnosis [106]. However, a public database that allows the collection of a sufficient number of labeled datasets is likely necessary to truly represent projections of the population. (2) Interpretability of skin disease recognition The progress of deep learning in skin disease recognition depends on a highly nonlinear model and parameter adjustment technology. However, the majority of the neural networks are "black box" models, and their internal decision-making process is difficult to understand. This "end-to-end" decision-making mode leads to the weak explanatory power of deep learning. The internal logic of deep learning is not clear, which makes the diagnosis results of the model less convincing. The interpretability research of skin disease classification could allow the owner of the system to clearly know the behavior and boundary of the system, and ensure the reliability and safety of the system. Moreover, it could monitor the moral problems and viola-tions caused by training data deviation and provide a superior mechanism to follow the requirements within an organization to solve the bias and audit problems caused by AI [107]. (3) Intelligent diagnosis and treatment of skin diseases Deep learning can be used to address the increasing number of patients with skin disease and relieve the pressure of limited dermatologists. With the popularity of mobile phones, mobile computers, and wearable devices, a skin disease recognition system based on deep learning can be expected to be available to intelligent devices to serve more people. Using a mobile device camera, users can upload their own photos of the affected area to the cloud recognition system and download the diagnosis results at any time. Through simple communication with the "skin manager", diagnosis suggestions and possible treatment methods could be available. Furthermore, the "skin manager" could monitor the user's skin condition and provide real-time protection methods and treatment suggestions. Computer diagnostic systems can assist trained dermatologists rather than replace them. These systems can also be useful for untrained general practitioners or telemedicine clinics. For health systems, improving workflows could increase efficiency and reduce medical errors. Hospitals could make use of large-scale data and recommend data sharing with a cloud-based platform, thus facilitating multihospital collaboration [108]. For patients, it should be possible to enjoy the medical resources of the top hospitals in big cities in remote and less modernized areas by telemedicine or enabling them to process their own data [109].

Conclusions
The potential benefits of deep learning solutions for skin disease are tremendous and there is an unparalleled advantage in reducing the repetitive work of dermatologists and pressure on medical resources. Accurate detection is a tedious task that inevitably increases the demand for a reliable automated detection process that can be adopted routinely in the diagnostic process by expert and non-expert clinicians. Deep learning is a comprehensive subject that requires a wide range of knowledge in engineering, information, computer science, and medicine. With the continuous development of the above fields, deep learning is undergoing rapid development and has attracted the attention of numerous countries. Powered by more affordable solutions, software that can quickly collect and meaningfully process massive data, and hardware that can accomplish what people cannot, it is evident that deep learning for