Message system architecture model in modern IM system

Time:2020-1-21

Preface

In the architecture part, we introduce the architecture of modern IM message system, the abstract model of Timeline and the typical architecture of a message system based on the Timeline model, which supports multiple advanced functions of “message roaming”, “multi terminal synchronization” and “message retrieval”. In order to simplify the reader’s understanding of the tablestore timeline model, the article briefly introduces the basic logic model of timeline, and makes a popular science on the basic concepts of multiple synchronization modes, storage and index of messages in the message system.

This article is a supplement to the architecture part. It will make a very detailed interpretation of the timeline model of the tablestore, so that readers can go deep into the implementation level to understand the basic functions and core components of timeline. Finally, based on the IM message system scenario, we will see how to realize the basic functions of message synchronization, storage and index in the im scenario based on the tablestore timeline.

Timeline model

The timeline model takes “simplicity” as the design goal, and the composition of core modules is relatively clear, mainly including:

  • Store: timeline repository, similar to the concept of database table.
  • Identifier: the unique identifier used to distinguish timeline.
  • Meta: used to describe the metadata of timeline. The metadata description adopts free schema structure and can contain any column freely.
  • Queue: all messages in a timeline are stored in the queue.
  • Message: the message body passed in timeline is also a free schema structure, which can contain any column freely.
  • Index: contains meta index and message index. It can customize index for any column in meta or message, and provide flexible multi criteria combination query and search.

Timeline Store

Timeline store is the repository of timeline, corresponding to the concept of tables in the database. The figure above shows the structure of the timeline store. All timeline data will be stored in the store. Timeline is a data model for massive messages, which is used for both message repository and synchronization library, and needs to meet a variety of requirements:

  • Support mass data storage: for message repositories, if permanent message storage is required, the data scale will be larger and larger with the accumulation of time. The repositories are required to be able to cope with the massive message data storage accumulated for a long time, and to reach the Pb level capacity.
  • Low storage costs: the distinction between hot and cold message data is obvious. Most queries will focus on hot data. Therefore, a relatively low-cost storage method is needed for cold data. Otherwise, with the accumulation of data over time, the storage cost will be very large.
  • Data lifecycle management: no matter for the storage or synchronization of message data, the data needs to define the life cycle. The repository is used to store the message data itself online, which usually needs to set a long period of saving time. The synchronization library is used for online or offline push of write diffusion mode, and usually a shorter save time is set.
  • Very high write throughput: for message systems in various scenarios, in addition to feed stream systems such as microblog and headlines, most of the instant messaging or friend circle message scenarios usually adopt the message synchronization mode of write diffusion, which requires that the underlying storage has a very high write swallowing and spitting ability to cope with the message flood.
  • Low latency reading: message system is usually applied in online scene, so it requires low latency for query.

The bottom layer of the tablestore timeline is a distributed database based on LSM storage engine. The biggest advantage of LSM is that it is very write friendly and naturally suitable for the mode of message write diffusion. At the same time, the query is also greatly optimized, such as hot data into cache, bloom filter and so on. The data table adopts the range partition partition partition mode, which can provide the service ability of horizontal expansion, and can automatically detect and handle the load balancing strategy of hotspot partition. In order to meet the different requirements of synchronization library and repository for storage, some flexible custom configurations are also provided, mainly including:

  • Time to live: the data life cycle can be customized, such as permanent saving or saving for n days.
  • Storage type: customized storage type. HDD is the best choice for a repository, and SSD is the best choice for a synchronous library.

Timeline Module

A large number of timelines can be stored in the timeline store. The detailed structure chart of a single timeline is shown above. It can be seen that timeline mainly includes three parts:

  • Timeline meta: metadata part, used to describe timeline, including:
  • Identifier: used to uniquely identify timeline and can contain multiple fields.

    • Meta: metadata used to describe timeline. It can contain any number of fields of any type.
    • Meta Index: metadata index, which can be used to index any attribute column in metadata. It supports multi field conditional combination query and retrieval.
  • Timeline queue: a queue used to store and synchronize messages. The elements in the queue are composed of two parts:
  • Sequence Id: sequence ID, the location information used to locate the message in the queue, and the sequence ID keeps increasing in the queue.

    • Message: the entity in the queue that hosts the message and contains the full contents of the message.
  • Timeline data: the data part of timeline is message, which mainly includes:
  • Message: message entity, which can also contain any number of fields of any type.

    • Message Index: message data index, which can index any column in the message entity, and supports multi field conditional combination query and retrieval.

IM message system modeling

Take a simple IM system as an example to see how to model based on the tablestore timeline model. According to the example in the above figure, there are three users a, B and C. A and B have a single chat, a and C have a single chat, and a, B and C form a group chat to see how the message synchronization, storage, and read-write processes are modeled based on the tablestore timeline in this scenario.

Message synchronization model

The message synchronization selection and write diffusion model can fully utilize the advantages of the tablestore timeline, and balance the read and write through write diffusion to balance the resources of the whole system. In the write diffusion model, each individual receiving a message has an inbox, and all messages that need to be synchronized to that individual need to be delivered to its inbox. In the example above, users a, B and C have their inboxes respectively. Each user’s different device side pulls new messages from the same inboxes.

Message synchronization library

The inbox is stored in the synchronization library, and each inbox in the synchronization library corresponds to a timeline. According to the example in the figure, there are three timelines as inboxes. Each message receiver has the SequenceID of the local latest pulled message, and each pull of a new message starts from the SequenceID. The query to the synchronous database will be frequent, usually the query to the latest message, so it is required that the hot data should be cached in memory as much as possible to provide high concurrency and low latency queries. Therefore, the configuration of the synchronization library generally requires SSD storage. If the message has been synchronized to all terminals, it means that the message in the inbox has been consumed and can be cleaned in theory. But in terms of design, we do not do active cleaning, but define a shorter life cycle for data to automatically expire, generally defined as one or two weeks. After data expiration, if you still want to pull new messages synchronously, you need to degenerate to read diffusion mode and pull messages from the repository.

Message repository

Messages of each session are stored in the message repository, and the Outbox of each session corresponds to a timeline. Messages in the Outbox can be pulled according to the session dimension. For example, browsing historical messages in a session can be completed by reading the Outbox. Generally speaking, new messages can be delivered to each receiving terminal through online push or query synchronization library, so the query to the repository will be relatively less. While the repository is used to store messages for a long time, such as permanent storage, the amount of data will be larger than the synchronous library. Therefore, the choice of the repository is generally HDD, and the data life cycle is determined by the time that the message needs to be saved, usually a long time.

Message index library

The message index library is attached to the repository, and uses the message index of timeline to index the messages in the repository, such as the full-text index of text content, the index of recipients, senders and sending time, etc., and supports advanced queries and searches such as full-text retrieval.

summary

This article mainly explains the tablestore timeline model, introduces the modules of timeline, including store, meta, queue, data and index, etc., and finally gives a simple im scenario to illustrate how to model based on timeline. In the next implementation, we will implement a simple IM system that supports single chat, group chat, metadata management and message retrieval based on the tablestore timeline. Please look forward to it.



Author: Muluo

Read the original text

This is the original content of yunqi community, which can not be reproduced without permission.