Application case: sequoiadb + spark to build hospital clinical knowledge base system


1. Background

Since the concept of digital hospital was put forward in the 1990s to now, digital hospital has been popularized and developed rapidly in major hospitals in China, and has made remarkable achievements. Not only has the digital hospital management information system (his), the image archiving and communication system (PACS), the electronic medical record system (EMR) and the regional medical and health service (GMIS) have been successfully implemented and popularized, but also with the rapid development of computer technology and network technology, it has brought new interactive channels for digital hospitals, such as telemedicine service and online registration About.

With the rapid development of IT technology, more than 80% of tertiary hospitals have established their own hospital information system (his), electronic medical record system (EMR), rational drug use system (pass), laboratory management system (LIS), medical image storage and sharing system (PACS), mobile ward round, mobile nursing system and integration with a large number of third-party interfaces Domain has entered a big data era. With the wide application of his and the continuous improvement of its functions, his has collected a large number of medical data.

In 2012, big data and related big data processing technologies are more and more mentioned by Chinese people. People also generally accept the concept of big data. Big data technology also affects our daily life. The Internet industry has been widely used, and telecommunications, banking and other industries have also widely tried to use big data technology to provide more stable and high-quality services.

In the current situation, the medical IT system collects these valuable data, but these valuable historical medical data do not play its due value, can not provide medical diagnosis assistance for front-line clinicians, nor provide necessary support for hospital management and business decision-making.

In view of the above situation, it is proposed to mine valuable medical rules and knowledge based on statistics by using the existing medical records, prescriptions, diagnosis and medical records data of the hospital, and build a professional clinical knowledge base based on these rules and knowledge information, so as to provide professional diagnosis, prescription and drug recommendation functions for front-line medical staff, based on the strong association recommendation ability, Greatly improve the quality of medical services, reduce the work intensity of front-line medical personnel.

2. Introduction of main technical architecture

2.1 SequoiaDB

Sequoiadb is an enterprise level distributed newsql database. It is independently developed and has independent intellectual property rights. It has no source code based on any other external open source database. Sequoiadb supports standard SQL, transaction operation, high concurrency, distributed, scalable, and dual engine storage, and has been open source as a commercial database product.

In addition to the JSON storage engine, in order to improve the read-write performance of unstructured files, sequoiadb core engine provides a distributed block storage mode, which can segment large unstructured files into different partitions according to fixed size data blocks. This function can realize the storage of massive unstructured files and can be referenced in scenes such as image storage.

Application case: sequoiadb + spark to build hospital clinical knowledge base system

2.2 Spark

Spark is a MapReduce like computing framework which is open-source and similar to MapReduce in the amp Laboratory of UC Berkeley University. It is a cluster computing system based on memory. The original goal is to solve the problem of disk read and write overhead of MapReduce. The latest version is 1.5.0. Spark – with its high performance and ease of use, spark has attracted many big data researchers. With the efforts of many enthusiasts, spark has gradually formed its own ecosystem (based on spark, the upper layer includes spark SQL, mlib, spark streaming and graphx), and has become the top project of Apache.

The core concept of spark is resilient distributed data sets (RDD). It is the abstraction of distributed memory by spark. Users can operate RDD just like local data sets, so they can concentrate on business processing. In spark programs, data operations are based on RDD, such as the classic wordcount program. Its operation mode under spark programming model is shown in the following figure:

Application case: sequoiadb + spark to build hospital clinical knowledge base system

You can see that spark abstracts rdd1 from the file system, and then rdd1 obtains rdd2 through the flatmap operator, and then rdd2 obtains rdd3 through the reducebykey operator. Finally, the data in rdd3 is rewritten to the file system, and all operations are based on RDD.

3. Ideas and structure

After many considerations, we finally decided to build and implement the hospital clinical knowledge base system based on spark technology. Sequoiadb was used to build the underlying data storage platform as the storage center of big data, and spark was used to build the big data analysis platform AgileEAS.NET SOA middleware builds ETL data extraction and transformation tool (pentaho is used in the later part) Kettle), based on AgileEAS.NET SOA middleware constructs the service portal of knowledge base, and integrates with his system through WCF / WebService AgileEAS.NET SOA + fineui builds the basic dictionary to manage the visualization function of analysis structure.

We chose sequoiadb as the big data storage center. For this reason, I specially completed the C ා driver for sequoiadb. At first, we chose spark1.3.1 and developed the hospital clinical knowledge base system with scala2.10. In the later stage of the project, we upgraded the computing framework from spark1.3.1 to spark1.6.2 (spark has recently released version 2.0, which has great performance stability Amplitude increase).

Considering that spark is deployed in Linux, the output of spark analysis results is also stored in sequoiadb database. The code of spark data analysis part is written with IntelliJ idea 14.1.4 tool, and the code of other parts is written with VS2010.

3.1 overall structure

The whole system consists of data acquisition layer, storage analysis layer and application logic layer, as well as external data sources selected by the system. At present, the external data source of this system is mainly clinical data generated by hospital information system, which is mainly concentrated in his system. In the later stage, EMR, LIS and PACS systems will be adopted.

Application case: sequoiadb + spark to build hospital clinical knowledge base system

Data acquisition layerIt is mainly responsible for collecting massive historical clinical data from clinical business system. Historical record collection method is divided into batch collection and real-time collection. In the process of data collection, the original data is checked by grid work, and the original data is cleaned and converted, and the processed data is stored in the big data warehouse.

Storage analysis layerIt is mainly responsible for data storage and data analysis. The reasonable and effective data after cleaning and conversion is stored in the big data cluster, using JSON format, big data storage reference using sequoiadb database, data analysis part is completed by spark cluster, big data storage is imported and analyzed by spark, and the analysis results are written into clinical knowledge database and clinical The knowledge database is also stored in sequoiadb.

Application logic layerIt is mainly responsible for human-computer interaction and the channel for feedback of analysis structure to clinical system. It provides tabular and graphical knowledge display to clinicians and business managers through webui, and also provides integrated API for business assistance and recommendation functions of clinical system. Currently, API is mainly provided by WebService and webapi.

3.2 system data flow

The whole system collects data from data sources, writes them into big data storage sequoiadb cluster, and then analyzes and calculates them by spark. The clinical knowledge generated by analysis is written into sequoiadb knowledge base, and is delivered to clinical use through webui and standard API.

Application case: sequoiadb + spark to build hospital clinical knowledge base system

3.3 data import process

Application case: sequoiadb + spark to build hospital clinical knowledge base system

Collection and import of historical data for initial usehttp://AgileEAS.NETThe planning tasks of SOA are implemented by C ා script, which is coordinated and regularly executed by the planned tasks. The specific data import code can be adjusted according to different clinical business systems. Pentaho kettle can also be used to realize data import.

3.4 system physical architecture

The clinical data source is the data source for the analysis of the system, which comes from the clinical his and EMR. At present, the hospital’s his uses SQL Server 2008 R2 database, and EMR uses Oracle 11g database, which runs on the windows 2008 operating system.

Sequoiadb cluster is a big data storage digital library cluster. At present, sequoiadb v2.0 is used to run on centos6.5 operating system. According to the scale of business, it uses 2-16 node cluster. It is used to store massive historical clinical data after cleaning and conversion for spark cluster analysis, and provides SOA server for historical data query and historical related recommendation.

Spark cluster is the core node of analysis and calculation of the system, which is used to analyze the historical data in sequoiadb cluster and generate medical knowledge to assist clinicians. According to the scale of business, the cluster uses 2-16 node cluster, centos6.5 operating system, java1.7.79 running environment, scala2.11.4 language and spark 1.3.1 analysis framework are used.

At the same time, sequoiadb is used as the knowledge storage database of the system. The analysis structure produced by spark cluster is written into this database and processed by SOA server and web service for clinical system integration and WebGui display.

SOA server is the external interface application server of the system, which provides business operation logic for clinical business system and web server, and provides service API for clinical business system. Currently, SOA server runs on Windows 2008 operating system and is deployed in. Net framework 4.0 environmenthttp://AgileEAS.NETThe SOA Service of SOA middleware is composed ofhttp://AgileEAS.NETSOA middleware SOA Service provides standard WebService and webapi to external system.

Web server provides a standard B / s browser user interface for the system to manage the system through the B / s web page, query and use the medical knowledge in the knowledge base. At present, it runs on Windows 2008 operating system, deployed with. Net framework 4.0 environment, running in iis7.0.

The clinical workstation system runs his and EMR systems, both of which are developed with SOA architecture of C ා. After integration with the system, it uses standard WebService to interface the system, and uses the API provided by the system to provide clinical diagnosis and treatment assistance.


Both NoSQL technology and spark technology, as the emerging technology architecture of big data, will be the core foundation of big data application. The distributed architecture of sequoiadb supports the village storage of massive data, and its JSON / lob architecture can meet the storage of unstructured data. These two characteristics can be said to be the core requirements of the medical industry for data. At the same time, as a data source, sequoiadb can connect well with spark architecture (sequoiadb is one of Spark’s more than 10 official certification publishers in the world). It can be said that sequoiadb greatly improves the performance and stability of the entire data system.

This paper comes from the practical application case of tsutsuga database community users
Welcome to join the open source community of Jushan database

Download the latest version of sequoiadb database 2.6

Sequoiadb database technology blog

Sequoiadb Tsuga database community