Construction of Human Digital Twin Model Based on Multimodal Data and Its Application in Locomotion Mode Identification

With the increasing attention to the state and role of people in intelligent manufacturing, there is a strong demand for human-cyber-physical systems (HCPS) that focus on human-robot interaction. The existing intelligent manufacturing system cannot satisfy efficient human-robot collaborative work. However, unlike machines equipped with sensors, human characteristic information is difficult to be perceived and digitized instantly. In view of the high complexity and uncertainty of the human body, this paper proposes a framework for building a human digital twin (HDT) model based on multimodal data and expounds on the key technologies. Data acquisition system is built to dynamically acquire and update the body state data and physiological data of the human body and realize the digital expression of multi-source heterogeneous human body information. A bidirectional long short-term memory and convolutional neural network (BiLSTM-CNN) based network is devised to fuse multimodal human data and extract the spatiotemporal features, and the human locomotion mode identification is taken as an application case. A series of optimization experiments are carried out to improve the performance of the proposed BiLSTM-CNN-based network model. The proposed model is compared with traditional locomotion mode identification models. The experimental results proved the superiority of the HDT framework for human locomotion mode identification.


Introduction
With the advancement of sensor technology, communication technology, and automation control technology [1][2][3], the traditional manufacturing industry has been transforming and upgrading to intelligent manufacturing, and the level of automation and intelligence has been continuously improved.Countries have also launched corresponding measures, such as "Made in China 2025" [4], "German Industry 4.0" [5], and so on.Cyber-physical systems (CPS) are multidisciplinary systems with collaborative computing capabilities that are closely related to the surrounding physical systems.It integrates computing, control, and communication technologies for feedback control of distributed computing systems [6][7][8].
Cyber-physical production system (CPPS) is the key paradigm of CPS in the manufacturing process, which promotes the continuous improvement of the automation level of the manufacturing system.Despite this, the important role of people in manufacturing has been ignored to a certain extent.Human is the most active and dynamic elements in manufacturing.Lack of attention to human beings in the traditional manufacturing process leads to potential safety hazards and reduces the efficiency of human-robot collaboration (HRC).Therefore, both academia and business are exploring how to improve the potential of HRC in smart manufacturing.Digital twin (DT) technology as an emerging paradigm offers new perspectives on the full lifecycle monitoring of machines in smart manufacturing.However, unlike machines, humans themselves are not equipped with sensors, and are difficult to be digitally represented.This poses a challenge to the deeper integration of humans and machines.Zhou et al. [9] proposed the framework of the human-cyber-physical systems (HCPS) and discussed the intelligent manufacturing architecture for the new generation of HCPS.On this basis, Wang et al. [10,11] expounded the possibility of the relationship and fusion between humans, cyber systems, and physical systems under the HCPS.
To achieve efficient, stable, and safe collaborative work between humans and robots, it is important to enhance the interaction and integration between humans and CPS, and realize the perception and acquisition of feature information of humans.The human body is described in the form of digitalization, to realize the human-centric in the manufacturing process.The human digital twin (HDT) is an important driving way of the HCPS [12].At present, the construction and application of HDT mainly focus on monitoring the health status of the human body and physiology.Song et al. [13] proposed a digital twin framework for the mechanical properties of the human skeleton in the manufacturing process, and established a digital twin model of the human lumbar spine to monitor the health status of the lumbar spine of workers in the manufacturing process, which can promote harmonious collaboration between human and robot.Makrini et al. [14] established an ergonomic optimization strategy for performing posture optimization based on virtual elements, and the optimization system is monitored by the operator and can analyze the operator's postures and assess the associated musculoskeletal disorder (MSD) risk.In order to monitor the physical state of workers under non-invasive conditions, a fatigue detection system based on perceived exertion (RPE) was constructed [15].In this system, neural networks and discrete wavelet transform are used to classify fatigue changes and extract features, which can effectively improve the accuracy of the proposed system.There is also decision-making that applies human cognition to CPS.Franceschi et al. [16] developed a production system framework based on multi-agent management and DT monitoring.The operator's cognitive ability is used for supervision and timely intervention in the event of failure, which can effectively improve the fault tolerance performance of the system.In the process of atmospheric plasma spraying, the dynamic changes of the eyes can also reflect the cognitive changes of the operator.Enabling cognitive assessment of complex processes by utilizing eye trackers can integrate operator cognitive abilities and expertise into the atmospheric plasma spray process [17].Zhang et al. [18] combined a pressure array sensor and infrared array sensor and proposed a deep learning-based human sitting posture recognition algorithm under the premise of protecting privacy and security, which can identify ten human sitting postures.Aiming at the problem of musculoskeletal diseases caused by sedentary and fixed postures of office workers, an ergonomic-based automatic posture assessment method using a depth camera sensor was proposed [19].The current research mainly focuses on establishing the association and application of certain human behaviors and states with manufacturing systems, and lacks a theoretical system to digitize the human body.
However, the research on the combination of HDT and human locomotion mode identification is less.In the study of human locomotion mode identification, sensors are usually used to obtain the characteristic information of the human body.While sensor data is time series data, the performance of the processing method will affect subsequent recognition tasks.The Hidden Markov Model (HMM) algorithm [20] can effectively deal with time series data, but it has disadvantages in dealing with long-term dependency problems.While the dynamic time warping (DTW) algorithm [21,22] has a simpler and more effective structure than the HMM algorithm, it is more sensitive to noise.Meanwhile, it is very sensitive to noise.With the rapid development of computer hardware in recent years, deep learning algorithms based on neural networks have provided a new solution to the above problems [23][24][25].Lee et al. [26] combined the long short-term memory (LSTM) and the bidirectional long short-term memory (BiLSTM) to build a novel network model, which can enhance the accuracy of the human gait phase estimation.A gait recognition algorithm based on LSTM and convolutional neural network (CNN) for the control of lower-limb exoskeleton robots was proposed [27].Chen et al. [28] designed different sub-network architectures for different sensors to extract the features of sensor fusion and then used LSTM to obtain the time series data of sensors, thereby improving the accuracy of the algorithm for human activity recognition.These studies mainly leverage one or two human locomotion mode information and improve the overall architecture of the neural network to improve the identification accuracy.However, these methods lack the completeness of the description of human characteristic information due to the simple modal information and the network structures cannot accurately and efficiently extract human characteristic information.
To this end, this paper proposes the construction architecture of the HDT model based on multimodal data for HCPS.By building a data acquisition system to perceive and acquire human body posture signals and physiological signals in the manufacturing process, an HDT model is constructed to describe the human operator in a digital form, so as to enhance the interaction between humans and CPS.Under the proposed HDT model architecture, we develop a novel multimodal data fusion method based on BiLSTM-CNN for human locomotion identification, which could facilitate real-time monitoring and identification of human locomotion mode.The efficiency and accuracy of the human locomotion mode identification are testified.
The remainder of this paper is structured as follows.Section 2 proposes the construction architecture of the HDT model based on multimodal data for HCPS.Section 3 introduces the key technologies and implementation process under the proposed architecture.Section 4 presents the specific application of the proposed HDT model framework in human locomotion mode identification.Meanwhile, a series of optimization experiments are carried out to improve the performance of the proposed BiLSTM-CNN network model.Besides, the proposed BiLSTM-CNN model is compared with the traditional locomotion mode identification model, which verifies the efficiency and accuracy of the proposed architecture in Section 2. The conclusions and the further work of this research are provided in Section 5.

Human Digital Twin Model Architecture
The new generation of intelligent manufacturing systems emphasizes the importance of humans in the manufacturing process and puts forward higher requirements for human-machine interaction.In the traditional intelligent manufacturing process, human is a participants independent of CPS, and cannot be integrated.HCPS described humans in the form of data to promote the fusion of humans and CPS, which is of great significance to realize the transformation and upgrading of the intelligent manufacturing system.
Due to the high complexity and uncertainty of the human body, it is necessary to integrate technologies such as human state perception, locomotion mode identification, feature parameter extraction, digital model construction, and deep learning algorithms to construct an efficient and accurate HDT model.This paper proposes a framework for the construction of an HDT model based on multimodal data, as shown in Figure 1.The HDT model framework includes four basic layers, namely physical entity layer, cyber model layer, data layer, and application layer.Through the effective integration of intelligent sensor perception and processing technology, a human digital model is constructed based on multimodal data, and an intelligent identification model based on a deep neural network is realized.The dynamic update and real-time analysis of HDT are also supported.There is a close relationship between these layers, and the function of each layer is described in detail next.

Physical Entity Layer
Physical entity layer includes the human body, intelligent sensors, and specific manufacturing environment elements.Intelligent sensors collect human body information as a basis.In an actual manufacturing workshop, it is very important to ensure the normal production of workers based on digitized human body information, so the use of cumbersome external equipment that will increase workers' extra burden should be avoided.Considering that multiple portable, lightweight smart sensors should be used to accurately characterize the human operators, a single sensor cannot comprehensively describe the multidimensional feature state of the human body.Using a variety of sensors and obtaining multimodal information can realize the acquisition of human characteristic locomotion information.Intelligent sensors mainly include inertial measurement units (IMU), plantar pressure sensors, a depth camera, and electromyography (EMG) equipment.These sensors can enhance the perception of human body status without interfering with the normal production of workers and have the advantages of being lightweight and portable.Through the use of intelligent sensors, the multi-source heterogeneous data of the human body can be collected and analyzed in real-time, to realize the perception and acquisition of the human body state and physiological signals in the manufacturing process.
Physical entity layer is the hardware foundation of the HDT model framework.Physical entity layer transmits the perceived and acquired body state and physiological signals to cyber model layer and data layer, promoting the construction of the HDT model.

Cyber Model Layer
Cyber model layer includes a database and a human digital model.The database contains the human body's realtime state data, historical data, and abnormal data.The key features of the human body multimodal data are chosen as the feature parameters of the human digital model, which include the position of human skeleton nodes, EMG signals, body acceleration, and plantar pressure distribution.Therefore, the establishment of a human digital model and the digital expression of human information in physical space is realized.This multimodal feature information can describe the changes in the human body state over time at different granularities and different spatial scales, which enhances the immersion and authenticity of the human digital model.
Cyber model layer is the underlying support of the HDT model framework.Cyber model layer obtains the current and historical multimodal data of the human body from the data layer, and then realizes the establishment of a dynamically updated and high-fidelity human digital model.The established human digital model serves as the input basis for the intelligent identification model in data layer.

Application Layer
Application layer includes a human state monitoring system and locomotion mode identification.Application layer can provide a series of human state monitoring and locomotion mode identification services according to the data and analysis results transmitted from data layer.The human state monitoring system is a real-time condition monitoring platform developed based on the web.According to the different human condition monitoring needs in the manufacturing workshop, it can be customized to generate different solutions.The human state monitoring system mainly includes human acceleration information, plantar pressure distribution, the position of human skeleton nodes, and EMG signals.Since the system is demonstrated through the cloud platform, it can be deployed synchronously on different devices, such as mobile phones and computers.Human locomotion mode identification is based on the intelligent identification model in data layer to analyze the key characteristic parameters of the human digital model, and obtain the real-time locomotion mode of the human body.
Application layer encapsulates the data, algorithms, and results in the HDT model according to different fields and different users, to realize convenient invocation and on-demand access to services.The current and historical multimodal data obtained from data layer and the output results of the intelligent identification model are used as the input of the human state monitoring system and locomotion mode identification, thereby realizing real-time and high-assurance monitoring of the human state and enhancing the perception and acquisition of human locomotion mode by machines, which makes the HDT model reliable, stable and interpretable, and improves the level of collaborative tasks between human and CPS.

Data Acquisition System and Preprocessing
To enable the HDT to perceive and update the human body state and physiological data in real-time, and accurately identify the human locomotion mode, a human body data acquisition, and preprocessing system is set up to obtain the multi-source heterogeneous data of humans.

Data Acquisition System
The system consists of an RGB-D camera, IMU sensors, EMG sensors, and plantar pressure sensors.An RGB-D camera (Kinect V2, USA) is used to obtain data on the overall position of human skeleton nodes.IMU sensors (Xsens Dot, USA) are installed on the human body's arms, legs, and other locations, which can measure 3-axis acceleration data.The EMG sensors (Delsys, USA) are set in the center of muscles, such as arms, legs, and other parts, to measure the changes in muscle signal intensity during human activities.A pair of plantar pressure sensors (Tekscan, USA) is attached inside the shoes to obtain data on the planter pressure distribution of the human body and gait state information during activities.The RGB-D camera communicated with the computer via USB.The IMU sensors, EMG sensors, and plantar pressure sensors are wirelessly connected to the computer terminal through Bluetooth and WIFI, and the data is transmitted to the computer terminal in real-time and then processed.

Data Processing (1) Normalization
The multi-source heterogeneous data of the human body obtained by the data acquisition system has different units, which may be too large or too small.If the data is directly used to build an HDT model, fuse multimodal data, and train a neural network model, it will lead to large deviations.And then it brings problems such as poor training effect and slow convergence speed.Therefore, it is necessary to normalize the data from different sensors.The normalization calculation formula is as follows.
where x i,j is the i-th element in the sampling data of the j-th channel; x i,j ' represents the normalized data of x i,j .(x j ) max and (x j ) min devote the maximum and minimum values in the sampling data of the j-th channel.After normalization, the maximum and minimum of x i,j ' are 1 and −1 respectively.
(2) Data segmentation The sliding window method of segmenting data is one of the main methods used to process time series data and facilitate LSTM to exploit how the data changes over time.The sliding window method is to cut the time series data according to a certain step size and divide the original complete data set into data subsets with window size.Furthermore, it can make different data subsets have a certain correlation and bring the data into the LSTM network structure for better training.Assuming that D represents the initial dataset, it can be expressed as follows: (1) where , and Z i j respectively devote the data at the i-th moment under the j-th channel of the X, Y, and Z sensor; n represents the time length of the initial dataset.
In this paper, the size of the sliding window is len, and the step size between the windows is st.Among them, the calculation formula of the k-th window is as follows:

Human Digital Model Construction
By analyzing the human body state and physiological signal data obtained by the data acquisition system and processing, the key feature information is extracted as the human locomotion feature signals, which comprehensively reflect the human body locomotion state and construct a human digital model.
The construction of the human digital model aims to use a variety of Intelligent sensors to obtain rich multimodal information about the human body, which can express the human body in a digital form during the manufacturing process.The construction of the human digital model is shown in Eq. ( 4): where S is the data set of the human digital model under the multi-source sensors.In this paper, S mainly includes the IMU acceleration signal dataset S I , the plantar pressure signal dataset S F , the RGB-D camera dataset S R and the EMG signal dataset S E .
The IMU sensors are used to model the acceleration and angular velocity information of the human body and are composed of 3-axis accelerometers and 3-axis The plantar pressure sensors model the plantar pressure distribution and gait information of the human body.The plantar pressure distribution map of the human body can be obtained by the pressure measurement system F-scan equipped with plantar pressure sensors.The obtained plantar pressure distribution map is divided into front and rear parts, and the average value of the front and rear sole pressure is taken as the pressure value of the toe and heel, to find the two events of heelstrike and toe-off.The digital model of human plantar pressure is shown in Eqs. ( 7)-( 9): where F h devotes the average distribution of heel pressure and is used to record the event of heel-strike during human activities; F t is the average distribution of toe pressure, and is used to record the event of toe-off during human activities.
The EMG sensors model the state of the human muscle groups.The EMG signals can effectively reflect the muscle activation level of the human body, and it has a (5) high correlation with the muscle contraction force [29].Therefore, EMG signals are applied to identify human locomotion intention in this study.Common muscle groups used for human lower extremity modeling were gluteus medius, right external oblique, semitendinosus, gracilis, biceps femoris, rectus femoris, vastus lateralis, vastus medialis, soleus, tibialis anterior, and gastrocnemius medialis [30].The digital model formula to obtain the human EMG signals is shown in Eq. (10): where E i is the activation of the i-th muscle groups.
The final component of the construction of the human digital model is the position information of the human skeleton nodes obtained by the depth camera.In this paper, the depth camera used is the Kinect v2 camera, which is composed of an RGB camera, infrared camera, and infrared projector.The infrared camera and the infrared projector constitute a 3D structured light depth sensor, and then the computer graphics technology can be used to get the skeleton position information of the human.The digital model of the human skeleton position is shown in Eqs. ( 11) and ( 12): where p i represents the position information of the i-th human skeleton; n is the total number of collected human skeleton position information; x i , y i , and z i respectively represent the coordinate values of the human skeleton nodes in the Cartesian coordinates.

Multimodal Data Fusion and Locomotion Mode Identification Using BiLSTM-CNN
Based on the proposed construction method of the human digital model, we design the neural network architecture composed of BiLSTM and CNN for multimodal sensor data fusion and human locomotion mode identification.
Regarding the identification of human locomotion modes, based on the proposed method of constructing a human digital model, the multimodal data mainly consists of acceleration signals, plantar pressure signals, electromyographic signals, and skeletal joint position signals.Among these, acceleration signals are selected from the right thigh and shank.Plantar pressure signals are selected from the right foot.Electromyographic signals are selected from the tibialis anterior muscle and the rectus femoris muscle.Skeletal joint position signals are selected from the anterior superior iliac spines (10)

BiLSTM
LSTM is a network structure improved based on recurrent neural network (RNN) structure, so LSTM can effectively deal with temporally changing data.LSTM has the advantage of overcoming the problems of gradient disappearance and explosion in RNN [31].Thus, it has been widely used in speech recognition, text translation, and state prediction.
LSTM is composed of three gated units: an input gate, a forget gate, an output gate, and a cell unit.σ is the sig- moid function, and is used to place the output range of the forget gate unit at [0,1].tanh is an activation function that controls the out range between −1 and 1.
The function of the forget gate is to use a forget gate unit to selectively the C t-1 sent from the previous node.The calculation formula of the forget gate unit is as follows: where W f and U f devote the weight of the coefficient matrix of the forget gate unit; b f is the bias parameter of the forget gate unit; C t-1 represents the state information of the previous cell.
The input gate can selectively memorize the input, memorize useful information and reduce the memory of useless information, to determine how much information to put in the current state.The calculation of the input gate is divided into two steps.The first step is to use the sigmoid function to obtain the corresponding i t .The second step needs to generate alternative data for updating, which is obtained by using tanh function.The calculation formula for the input gate unit is as follows: (13) x t = {I t ax , I t ay , I t az , I s ax , I s ay , I s az , F h , F t , E t , E r , p l , p r }, (14) where W i , W c , U i , and U c respectively represent the weight coefficient matrix of the input gate unit; b i and b c represent the bias parameters of the input gate unit; C t is used to update the state information of the next cell.
The output gate decides which part of the information needs to be output according to the current cell state.Firstly, the output gate gets an o t to control the content of the output par.Then use the tanh function to process the cell state, so that the calculation formula of the output gate unit can be obtained as follows: where W o , U o respectively devotes the weight coefficient matrix of the output gate unit; b o represents the bias parameters of the output gate unit.
Considering the process of training LSTM, the time series data is arranged in chronological order and then input into the network.Therefore, LSTM only considers the forward propagation of time series data, and ignores the backpropagation.BiLSTM adds backward learning based on the original LSTM model and obtains knowledge information from the backpropagation, which forms a two-way learning framework.In practical application scenarios, the process of identification and prediction often involves the information of the entire input sequence, so BiLSTM can well overcome the problem of one-way of LSTM.BiLSTM is proposed (17) to process multimodal data fusion and model in HDT model.

CNN
CNN is a kind of feedforward neural network, which is composed of convolutional layers, pooling layers, and a fully connected layer.
Convolutional layers and pooling layers are the most important differences between CNN and traditional deep neural networks.The convolution layer is composed of a convolution kernel and an activation function.The local feature extraction of the input data can be achieved by using the convolution kernel.Through the parametersharing mechanism, the training parameters of the network are greatly reduced and the training efficiency is improved.The pooling layer is located after the convolutional layer.By extracting the main features of a certain area, the feature map and the number of parameters are reduced to prevent the model from overfitting.Currently, the commonly used pooling methods are maxpooling and average pooling.It is mainly necessary to extract the feature information of the data and ignore the interference of useless information on the accuracy in the process of locomotion mode identification.This paper chooses the maxpooling method.The fully connected layer is used to map the features extracted from the data by the convolutional and pooling layers with the data classification labels in the sample space.By connecting all neurons in the fully connected layer with the neurons in the previous layer, all local features are combined into global features.CNN can more efficiently extract the features in the data, thereby improving the accuracy of locomotion mode identification.The HDT model selects CNN to extract feature information in human locomotion mode identification.

Proposed Neural Network
This paper comprehensively considers the characteristics of BiLSTM and CNN.BiLSTM can process and model the input time series data from a bidirectional perspective.CNN can quickly and efficiently extract the characteristics of data features.Therefore, a BiLSTM-CNN network structure is proposed.The network structure is shown Figure 2.
BiLSTM-CNN model consists of the raw data layer, preprocess layer, BiLSTM layer, CNN layer, and output layer.The BiLSTM layer is formed by stacking two BiLSTM units and one LSTM unit, in which the number of BiLSTM units and LSTM units is 128.CNN layer includes two convolutional layers, two pooling layers, and one fully connected layer.The two convolutional layers are conv 1 and conv 2 respectively.The number of convolution kernels of conv 1 is 128, and the parameters and step size of the convolutional kernel are [1,3] and [1,1] respectively.The number of convolution kernels of conv 2 is 64, and the parameters and step size of the convolutional kernel are [1,3] and [1,1] respectively.The pooling method used by the two pooling layers is maxpooling.The size of the pooling units in the two pooling layers is [1,3] and the stride is set to 2. The second pooling layer is connected with a fully connected layer.Finally, the six locomotion mode results are output by the Softmax connected with the fully connected layer.The sensitivity analysis of the hyperparameters from the proposed network is described in Section 4.
The optimizer used in the training process of the BiL-STM-CNN network model is the Adam optimizer, where the learning rate is set to 0.001.The loss function of the network model is cross-entropy, and its calculation formula is as follows: where y i is the vector of the locomotion mode classification result after Softmax output; y i ' represents the vector of the actual locomotion mode label.Both y i and y i ' are vector results obtained by encoding according to one-hot.
The training epoch of this network model is 100.The batch size is 50, and the step length is 128.To ensure the training effect of the network model, the data set is divided into two parts, of which 75% is used as the (20) training set and 25% is used as the testing set.To prevent overfitting during model training, the L2 normalization is added to the loss function.acquisition system collected a total of 9000 sample data, including six locomotion modes, namely walking, standing, squatting, going upstairs, going downstairs, and fast walking, with 1500 sample data for each locomotion mode.75% of the data is used to train the model and 25% of the data is used to test the model.

Case Study
To verify the importance of the proposed HDT model framework in human locomotion mode identification, the accuracy and F1 score are used as the evaluation indicators of the performance of the BiLSTM-CNN model.The Accuracy is used to evaluate the proportion of correctly identified data to the total data in the locomotion mode identification data set, and its formula is as follows: where TP, TN, FP, and FN respectively devote the number of true positive, true negative, false positive, and false negative in the process of classifying sample data.
While the formula for calculating the F1 score is based on precision and recall.The formulas for calculating Precision and Recall are as follows: Targeting only precision or recall has limitations and can lead to extremes in model training, and the accuracy can also be affected by imbalanced samples.Therefore, to take into account the precision and recall, while avoiding the impact of sample imbalance, the F1 score needs to be used [32][33][34].The F1 score is a harmonic mean.Its calculation formula is as follows: The experimental results are shown in Figure 5.According to the change in the accuracy rate in Figure 5, it can be found that when the number of hidden units is the same and the number of BiLSTM layers is 2, the accuracy of the corresponding model recognition is higher.At the same time, when the number of BiLSTM layers is set to 2 and the number of hidden units is 32, the accuracy of model recognition is the highest.Therefore, the number (24) F 1 = 2 × Precision × Recall Precision + Recall .In the training process of the BiLSTM-CNN model for human locomotion mode identification, the learning rate is closely related to the training duration and convergence speed.On the one hand, when the learning rate is low, the overall update and convergence speed of the model training process is slow.On the other hand, when the learning rate is high, there will be oscillations during the model training process, failing to converge to the optimal solution.Therefore, an appropriate learning rate is crucial to the training of the model.According to the data in Table 1, it can be found that when the learning rate is 0.001, the accuracy of human locomotion mode identification is the highest.Therefore, the learning rate during BiLSTM-CNN model training is set to 0.001.
To prove the superiority of the BiLSTM-CNN model proposed in this paper in human locomotion mode identification, the proposed model is compared with several existing human locomotion mode identification models, including LSTM and CNN.In the process of comparing different models, LSTM and CNN have the same structure and hyperparameters.The confusion matrix of three different models in human locomotion mode identification is shown in Figure 7.Among them, the human locomotion mode identification of the BiLSTM-CNN model proposed in this paper has the highest accuracy rate of about 99.82%.The identification accuracy of LSTM, and CNN models are all below 96%.At the same time, we compared the F1 scores of different models and found that the F1 score of the BiLSTM-CNN model exceeded 99%, which was significantly better than the other two types of models.Therefore, under the framework of the HDT model proposed in this paper, the proposed BiL-STM-CNN model for human locomotion mode identification has a good recognition accuracy.

Conclusions
With the widespread implementation of human-centered intelligent manufacturing, it has become an inevitable requirement to improve the efficiency of collaborative work between humans and manufacturing systems.To address the above challenges, this paper proposes a framework for building a human digital twin model to enhance the in-depth integration of humans and CPS.In summary, the main contributions of this paper are as follows: (1) A framework for modeling an HDT based on multimodal data is proposed in this paper.To realize the dynamic update and real-time analysis of the HDT, the framework integrates key technologies for the construction of the HDT.The data acquisition system and preprocessing technology realize the dynamic perception of the human body.The human digital model construction technology facilitates the digital representation of complex human features.
(2) A multimodal data fusion method based on BiL-STM-CNN is devised to realize real-time monitoring of human and locomotion mode identification.(3) Furthermore, we also conduct experiments on hyperparameter optimization in the BiLSTM-CNN network and compare the proposed model with traditional locomotion mode identification models.
The results indicate that the proposed model has an accuracy rate of about 99.83% for six human locomotion modes.
The HDT model building framework based on multimodal data can achieve high-quality modeling of the complex human body and help realize dynamic mapping of human physical and virtual entities.The framework obtains good results in human motion pattern recognition experiments and outperforms traditional locomotion mode identification algorithms.
The construction of the HDT model is to better serve the HCPS and improve the automation level of manufacturing.However, the construction of the HDT model framework only focuses on the identification of human locomotion modes in manufacturing scenarios, and does not establish the connection between the results of the HDT model and the intelligent control of the machine.In the next step, we will focus on combining the constructed HDT model with the CPS so that the machine can perceive the changes in the human body and physiological data, and then promote the adaptive control of HCPS.
Data layer includes real-time human state data, historical data, and the intelligent identification model based on deep neural networks.The real-time state data of the human body is composed of IMU signals, depth camera signals, plantar pressure signals, and EMG signals.By integrating these real-time state data with historical data, the human body information is obtained from both current and historical perspectives, to ensure the comprehensiveness and accuracy of the perception of the human body.The intelligent identification model based on a deep neural network is composed of BiL-STM and CNN.Data layer is the theoretical basis of the digital twin model of the human body.The real-time multimodal data measured by a variety of Intelligent sensors is obtained from physical entity layer as the input of data layer.All data is transferred through the data transfer module.The data transfer module includes Bluetooth, 5G, etc.These key technologies greatly shorten the data transmission time and provide a guarantee for the efficient transmission of information between the physical entity layer, cyber model layer, and application layer.The intelligent identification model based on the deep neural network can identify the current locomotion mode of the human body according to the human digital model in cyber model layer, and provide theoretical support for application layer.

Figure 1
Figure 1 HDT model framework based on multimodal data ) S = {S I , S F , S E , S R }, gyroscopes.The accelerometers can detect the acceleration signals of the x, y, and z axes of the installation position in the carrier coordinate system.The accelerometer can detect the acceleration signals of the x, y, and z axes at the installation position in the carrier coordinate system.Using these signals, the posture information of the current installation position of the human body can be obtained.After processing the obtained IMU signals, the acceleration and angular velocity signals of the human body at the corresponding installation position can be obtained, as shown in Eqs.(5) and (6): where I k is the data of the k-th IMU sensor; m is the number of IMU sensors; I ax k , I ay k , I az k represents the acceleration signals of the k-th IMU sensor in x, y, and z axes; I gx k , I gy k , I gz k represent the angular velocity signals of the k-th IMU sensor in x, y, and z axes.

(
ASISs).The expression for inputting multimodal data is as follows: where I ax t , I ay t , and I az t represent the acceleration signals at the thigh; I ax s , I ay s , and I az s represent the acceleration signals at the shank; E t and E r devote the electromyographic signals at the tibialis anterior muscle, and the rectus femoris muscle respectively; p l and p r are the skeletal joint position signals at the anterior superior iliac spines.

Figure 2
Figure 2 The structure of BiLSTM-CNN network

4. 1 Figure 3
Figure 3 Data acquisition system Accuracy = TP + TN TP + FN + FP + TN , In this section, we first discuss the influence of hyperparameters in the BiLSTM-CNN model on its locomotion mode identification performance.The number of BiL-STM layers and the number of hidden units, convolution kernel size, and learning rate in the BiLSTM-CNN model are optimized.Then, the BiLSTM-CNN model proposed in this paper is compared with the conventional locomotion mode identification models, including LSTM and CNN.Finally, the superiority of the BiLSTM-CNN model in the accuracy of locomotion mode identification is proved.The training process of the BiLSTM-CNN model is shown in Figure 4. To study the influence of the number of BiLSTM layers and the number of hidden units in the BiLSTM-CNN model on the proposed human locomotion mode model, the number of BiLSTM layers is set to 1, 2, and 3.The number of hidden units of BiLSTM is set to 16, 24, 32, and 64.The remaining parameters of the BiLSTM-CNN model remain the same.Comparative experiments are carried out by adjusting the number of BiLSTM layers and the number of hidden units in BiLSTM units.

Figure 4 Figure 5
Figure 4 Convergence change of test loss during model training

Figure 6
Figure 6 The influence of convolution kernel size on recognition accuracy

Figure 7
Figure 7The confusion matrix of three different models in human locomotion mode identification

Table 1
The influence of learning rate on recognition accuracy