How to extract value from messy medical big data (1)


Business Tags: hospital information integration platform, Internet hospital, Internet nursing, chronic disease follow-up

Technical labels: ESB, ETL + CDC, NLP, FAAS, SaaS, Hadoop, microservice


Technology wechat group:
Add wechat: wonter send: technical Q
Medical wechat group:
Add wechat: wonter send: Medical Q


—————— BEGIN ——————

How to extract value from messy medical big data (1)

How to extract value from dirty and poor medical data (two) pay attention to official account number

Introduction: with the accelerating process of medical and health information construction, the type and scale of medical data are also growing at an unprecedented speed. In the era of big data, learning to analyze data and apply it to work can not only save time and improve efficiency, but also extract its value and provide convenience for doctors and patients.

1、 Background of medical big data

Urgent needs:

  • Lack of a unified overall top-level design: stackable construction with his as the core, poor scalability of the overall hospital information, it is difficult to adapt to the follow-up information development;

  • The depth of clinical application is not enough: to realize the hospital’s clinical centered, provide excellent medical services for patients, realize comprehensive and accurate clinical medical information source, and improve the medical quality and service level;

  • Lack of refined operation support: no centralized and unified management of resources, no integrated management of hospital personnel and property;

  • Lack of hospital wide data integration and low utilization rate of data: the utilization rate of data is low, which does not support the clinical and management;

  • The coupling degree between information systems is too high: Based on the integrated design of his system, the coupling degree is high; Centralized data storage brings problems such as system upgrade and data security;

  • Data standard Standardization: lack of data standard of the whole hospital, low data sharing rate between systems.

2、 Sources of medical big data

1. Data generated during the patient’s medical treatment

From the beginning of registration, personal name, age, address, telephone and other information will be input into the system; In the process of face-to-face diagnosis, the patient’s physical condition, medical image and other information will also be entered into the system; After the end of medical treatment, expense information, reimbursement information, medical insurance use information are added to the hospital system.

This is the most basic and huge raw data resource of medical big data.

2. Clinical medical research and laboratory data

The integration of clinical and laboratory data makes the data of medical institutions grow very fast. A common CT image contains about 150 MB of data, and a standard pathological map is close to 5 GB.

If we multiply the amount of data by the number of population and the average life span, the cumulative amount of data in a community hospital alone can reach trillions of bytes or even tens of billions of bytes (PB).

3. Pharmaceutical enterprises and Life Sciences

The data generated by drug research and development is quite intensive, and it is more than 10 billion bytes (TB) for small and medium-sized enterprises.

In the field of life sciences, with the gradual increase of computing power and gene sequencing ability, Jason Bauby, director of the personal genome project at Harvard Medical School, believes that by 2015, 50 million people will have personal gene maps, and the size of a genome sequence file is about 750 MB.

4. Health management brought by intelligent wearable devices

With the rapid development of mobile devices and mobile Internet, portable wearable medical devices are becoming popular. Individual health information will be directly connected to the Internet, which will realize the collection of personal health data anytime and anywhere, and the amount of data information will be immeasurable.

3、 The value of medical big data

1. Serving residents

Residents’ health guidance service system provides accurate medical treatment and personalized health care guidance, so that residents can maintain continuity of service in hospitals, communities and online. Miaohealth provides a professional doctor team, which can help users solve various diseases online and provide health guidance.

Residents can also purchase daily medicines in their palm pharmacies, which is convenient and fast.

2. Service doctors

Clinical decision support, such as medication analysis, adverse drug reactions, disease complications, treatment effect correlation analysis, antibiotic application analysis; Or make personalized treatment plan. Through miaohealth online doctor, doctors can conduct relevant medical guidance online, make medical diagnosis, and effectively reduce the number of outpatient service.

3. Service for scientific research

It includes disease diagnosis and prediction, statistical tools and algorithms to improve clinical trial design, analysis and processing of clinical trial data, such as identifying disease susceptibility genes for major diseases, people with extreme performance, and establishing personal health and medical records.

The establishment of personal health and medical records can share personal medical information, enable doctors to directly and quickly understand the patient’s past medical history, avoid the phenomenon of repeated consultation, and enable patients to receive timely and effective treatment.

4. Service management organization

Standardized drug use evaluation; Evaluation of prevention and intervention measures for epidemic and emergency diseases; Public health monitoring, optimization of clinical pathway, etc.

5. Public health services

It includes monitoring and early warning of health risk factors, network platform, community service and so on. Through data collection, risk assessment, health intervention and other ways to provide customers with health management and a series of services.

4、 The status quo of medical big data

1. Heterogeneous data

Multi platform, multi interface, data type without a standard, can only be a point-to-point docking of a large number of data, cumbersome content, complex process, slow speed.

2. Theme dispersion

Medical information is distributed on different platforms, which can not form all electronic patient-centered medical information integration, and can not provide complete, comprehensive, accurate and timely clinical information of patients.

3. Large amount of data

In the context of big data, the amount of data in industry applications is usually calculated at the level of 100 million, and the storage is usually at the level of TB / Pb or even more.

4. Data polymorphism

The data model can only be determined after the emergence of data, and the data model evolves with the growth of data.

5、 Establish medical big data asset catalog

According to the “technical solution for the construction of hospital information platform based on electronic medical records – business part” issued by the statistical information center of the Ministry of health in March 2011

1. Clinical service domain

It includes 12 secondary categories: patient identification, patient service, in out transfer, medical orders, medical records, nursing documents, inspection, examination, surgical anesthesia, treatment, blood transfusion, health examination, with a total of 26 business subdomains.

2. Hospital management domain

It includes four secondary categories: medical management, human resource management, financial management, material and logistics service management. There are 26 business subdomains.

3. Platform application domain

It includes five secondary categories: regional medical collaboration, management decision-making, clinical decision-making, public health information reporting and patient public service. There are 20 sub domains.

According to the clinical service, hospital management and platform application domains, the data asset directory with business activities as the core is constructed, and the data element identifier is sorted according to the business activity theme, so as to ensure the uniqueness of each data element identifier (basic data set: urban and rural residents’ health records, disease management, medical services, electronic medical records, etc.).



1) The source of data element identifier “de08.10.052.00” is based on tchia 7.3-2018 electronic medical record data set for hypertension specialty Part 3: outpatient (emergency) prescription for hypertension


2) The data allowable value “ws218 – 2002” is based on the classification and code of health institutions (organizations) ws 218-2002


6、 Extracting data from business systems

Associate and map business system table fields, and create scheduling tasks, as shown in Figure 5


In the case of mapping multiple tables, you need to select the primary foreign key field of the associated table, as shown in Figure 6




The scheduling task can support two modes: single table and full table, historical data extraction and real-time data monitoring extraction




7、 Data quality control

A complete index system for data quality evaluation should at least have integrity (events, forms, records, table items), consistency (master data consistency, logical consistency), uniqueness (no ambiguity redundancy, indicators and calculation caliber), timeliness, originality, traceability and measurability.

8、 Data center construction

Quickly locate business topics according to the data asset directory, as shown in Figure 8



According to different business scenarios, user-defined check data element name, automatically generate API or new theme library (data mart).