Research on typical characteristics and development direction of medical data architecture



At present, the medical and health industry is in a high-speed development state, and is in the key stage of enabling the medical industry through the Internet. Due to the strong privacy of medical industry data, it is difficult to obtain public medical and health data through traditional methods for research. It is an ideal way to start analysis according to the research on the design of Alibaba cloud Tianchi competition and the desensitization data set provided. The purpose of this paper is to think about the process of hospital information system, and analyze the characteristics of medical and health data with open data set, so as to get the development direction of data architecture model of medical and health industry in the future.

Medical and health data characteristics

First, take a look at the recent two competitions of Tianchi competition, both of which are based on medical data research and mining. Desensitization data is used, and the data comes from actual cases, so the reference value is high:

Research on typical characteristics and development direction of medical data architecture

Research on typical characteristics and development direction of medical data architecture

By analyzing the data set form provided by the two competitions, it can be clearly felt that the characteristics of medical data set are data heterogeneity, that is, due to the relationship between medical detection means, the proportion of data visualization is relatively high. However, because the training data set needs to conduct overall analysis according to other characteristics of patients, including gender, age, height, weight, etc., it also contains a part of structured data Therapeutic data set is a typical heterogeneous data set with unstructured data and structured data.

Analysis of common prediction algorithms

The prediction results needed by medical data are generally classified. Because the main purpose of the results is not to make qualitative conclusions directly, but to provide reference for doctors, both two classifications (i.e. yes or no) and multi classifications (divided into several categories) have practical value.

According to the results of the cervical cancer risk intelligent diagnosis competition, the preliminary malignant cell detection algorithm belongs to the two classification problem, while the second round cervical cancer malignant cell detection and classification algorithm belongs to the multi classification problem, that is, the detection results need to be classified into five typical cervical cancer categories.

In terms of data processing, it is necessary to combine the training set image input with the doctor’s manual annotation information and the patient’s feature information. Therefore, the general use of deep learning algorithm is inevitable. Because a single CT image and annotation information can only belong to one patient, it is very appropriate to use JSON file as a record file. Compared with structured form, a single CT file corresponds to a single JSON file Can better record data.

Research on typical characteristics and development direction of medical data architecture

From the analysis of data size, thousands of cervical cancer cytology pictures and corresponding abnormal squamous epithelial cell location labeling, each data obtained under 20 times digital scanner, size 300-400M. Therefore, the size of the training data set is about 273g when the training set contains 800 pictures, and unstructured data accounts for the majority.

According to the results of human-machine intelligence competition, the classification of ECG abnormal events is a multi classification problem, that is, the detection results need to be classified into the abnormal events in the training set. 40000 medical ECG samples. Each sample has 8 leads, I, II, V1, V2, V3, V4, V5 and V6. The sampling frequency of a single sample is 500 Hz, the length is 10 seconds, and the unit voltage is 4.88 microvolts. Therefore, the data has been structured in the output of detection equipment. Compared with the feature extraction and data processing of CT images, the deep learning algorithm is not needed, and the conventional data preprocessing method can meet the needs.

From the algorithm point of view, deep learning algorithm is needed to calculate the image. RNN, convolutional neural network, is the mainstream algorithm of image recognition. According to the statistics of the open algorithms of the two competitions, almost all of the algorithms used in the cervical cancer risk intelligent diagnosis competition are based on the deep learning algorithm of neural network. The difference is that the deep learning framework adopted is different from the algorithm derived from neural network. It represents that the algorithm adopted by the data science community for the future unstructured medical data is unified in the general direction. At present, the algorithm of ECG human-machine intelligence competition is machine learning classification algorithm. At present, the classification algorithm based on decision tree occupies an absolute dominant position. The machine learning algorithm derived from decision tree, such as RF random forest algorithm, gbdt algorithm and lightlgbm algorithm, also accounts for the majority. Lightlgbm algorithm is most commonly used.

From the perspective of cross validation set adjustment and test set validation effect evaluation, cancer oriented algorithm and other algorithms such as heart abnormality algorithm need to pay attention to different aspects. Cancer detection results have a great psychological impact on patients and their families, so we need to pay great attention to the balance between the accuracy rate and recall rate, to prevent the grass and wood situation caused by algorithm over fitting It also increases the workload of doctors’ review. But the problem of over fitting is not so serious for the heart abnormality algorithm or other common biochemical index data, because the data volume is to a certain extent, according to the large number theorem, even over fitting will gradually tend to develop to a more accurate trend. Especially for the judgment of cardiac abnormalities, high accuracy is extremely important, because the real-time data is strong and the value decreases rapidly with the time change. Even if the data is misreported due to over fitting, it can make patients or family members pay attention to it.

Medical data processing architecture scheme

According to the above analysis results of medical data characteristics and data mining algorithm, the architecture scheme of medical data processing is studied.

The coexistence of structured and unstructured medical data results in the need for heterogeneous computing using CPU and GPU. According to the actual conditions of the hospital, the source of unstructured data is mainly the image generated by radioactive inspection equipment, such as CT, the size of each image is about 350m, while biochemical indicators including ECG indicators can be presented with structured data. Unstructured data processing needs a lot of GPU computing power, so it is impossible to require the hospital to expand the capacity of local IDC room and increase the GPU cluster. Therefore, from the perspective of architecture, cloud fog edge collaboration will be an ideal architecture.

1 edge computing node

The computing nodes near various detection devices (including the PC with the device and the doctor’s view results) constitute the edge computing nodes in the collaborative system. However, the computing force of edge computing is relatively weak under the existing technical conditions, which can not require the edge nodes to carry out large-scale image recognition calculation. Therefore, the main task of the edge computing node is data cleaning and transmission to the fog end There are many kinds of examinations in the hospital, and the data formats of various reports and image information are not uniform. Therefore, data cleaning at the edge in advance will help to reduce the computing pressure at the fog end and cloud end, and help the hospital realize the possibility of unified data in the future.

2 fog calculation node

The existing local IDC room of the hospital can be considered as a fog computing node, which is particularly important for the medical industry at present. Although 5g technology meets the requirements of large-scale data transmission in terms of time delay and transmission speed, due to the complex environment of the hospital, if the data of the edge computing node needs to be directly transmitted to the cloud, it will rely heavily on wireless communication means at the network layer However, in the process of wireless communication, especially 5g high frequency communication between edge computing node and cloud, whether it will interfere with medical equipment and other unexpected problems need to be studied in practical application. In the short term, edge computing node data is the most appropriate method to transmit to fog computing node through wired communication means.

There are many practical functions of fog computing nodes, such as centralizing edge computing node data, distinguishing application scenarios and computing. Especially, if the local IDC server cluster configuration of individual hospitals is strong, the structured data can be mined, trained and predicted locally without being transmitted to the cloud. In addition, from the communication point of view, the fog end as a unified data outlet to the cloud wireless transmission of data can avoid the possible interference of wireless signals to medical devices. When 5g is not popular or the cost is high in the short term, local IDC and cloud dedicated line communication can be used as the transition means.

In hospitals with multiple hospital areas, local IDC in different regions can be used as fog end to carry out remote disaster recovery construction. Multiple local IDC rooms are mutually disaster recovery in different regions to ensure that single node failure can be migrated in time to ensure business continuity and the availability and integrity of stored data.

3 cloud

The cloud computing platform can well solve the reality that the hospital heterogeneous data computing needs a lot but can not be configured with large-scale GPU cluster in a short time. The high-definition image files generated by CT and other radioactive inspection facilities and other data that need to use the deep learning algorithm can be transmitted to the cloud through the fog end for computing. The advantages of cloud computing elastic scaling are in the face of hospital computing power demand When the number of patients fluctuates in time, the cost of heterogeneous computing can be reduced as much as possible. The configuration of GPU cluster can automatically expand the computing nodes when the demand for computing power is large, and automatically reduce the scale of virtual machines in the cluster when the demand is small.

Author: Zhu Qi

Read the original text

This is the original content of yunqi community, which can not be reproduced without permission.