Cassandra data modeling


one   summary

Apache Cassandra stores data in tables, each consisting of rows and columns. CQL (Cassandra query language) is used to query the data stored in the table. Apache Cassandra data model is based on query and optimized for query. Cassandra does not support relational data modeling for relational databases. Cassandra data modeling focuses on queries.

Data modeling in Cassandra uses a query driven approach, in which specific queries are the key to organizing data. Query is the result of selecting data from a table, and schema is the definition of the arrangement of data in the table. Cassandra’s database design is based on the requirement of fast reading and writing, so the better the architecture design, the faster the data writing and retrieval speed.

On the contrary, the relational database normalizes the data according to the designed tables and relationships, and then writes the query to be carried out. Data modeling in relational database is table driven, and any relationship between tables is represented as table join in query.

1.1.   What is data modeling

Data modeling is the process of identifying entities and their relationships. In a relational database, data is placed in a normalized table with a foreign key, which is used to refer to related data in other tables. The query that the application will make is driven by the structure of the table, and the related data is queried as a table join.

In Cassandra, data modeling is query driven. Data access patterns and application queries determine the structure and organization of data, and then use it to design database tables.

Data is modeled around specific queries. The best design of a query is to access a single table, which means that all entities involved in the query must be in the same table to make data access (read) very fast. Data is modeled to best fit a query or a set of queries. A table may have one or more entities that are most suitable for querying. Since there are usually relationships between entities, and queries may involve entities that have relationships between entities, a single entity can be included in multiple tables.

1.2.   Query driven modeling

In the relational database model, queries use table joins to get data from multiple tables, but in Cassandra, joins are not supported, so all the required fields (columns) must be combined in one table. Since each query is supported by a table, in the process called denormalization, data will be redundant among multiple tables. Data redundancy and high write throughput are used to achieve high read performance.

1.3.   target

The choice of primary key and partition key is very important for the uniform distribution of data in the whole cluster. It is also important to keep the number of partitions read by queries to a minimum, because different partitions may be located on different nodes, and the coordinator will need to send requests to each node, thus increasing the overhead and latency of requests. Even if the different partitions involved in the query are located on the same node, fewer partitions can improve the query efficiency.

1.4.   partition

The partition key is generated from the first field of the primary key. Using the partition key to partition the data to the hash table can provide faster queries. The fewer partitions used for the query, the faster the response time of the query.

The following is an example of a partition, assuming that table t has a host ID

   id int,
   k int,
   v text,

Partition key is generated from primary key ID, which is used to allocate data among nodes in the cluster.

The following example is a composite primary key

   id int,
   c text,
   k int,
   v text,
   PRIMARY KEY (id,c)

For table t with composite primary key, the first field ID is used to generate partition key, and the second field C is used to sort cluster key within partition. Using clustering key to sort data can improve the efficiency of retrieving adjacent data.

Generally, the first field of the primary key is hashed to generate the partition key, while the remaining fields are the clustering keywords used to sort the data in the partition. Partitioning data can improve the efficiency of reading and writing. Other fields that are not primary key fields can be indexed separately to further improve query performance.

In the next example, Id1 and Id2 users generate partition keys, and C1 and C2 are used to sort the cluster keywords within the partition.

   id1 int,
   id2 int,
   c1 text,
   c2 text
   k int,
   v text,
   PRIMARY KEY ((id1,id2),c1,c2)

1.5.   Comparison with relational data model

Relational databases use foreign keys to store data in tables that are related to other tables. The data modeling method of relational database is table centered. Queries must use table joins to get data from multiple tables, which have relationships. Apache Cassandra has no concept of foreign key or relational integrity. Cassandra’s data model is based on efficient query design, which does not involve multiple tables. Relational database normalizes data to avoid duplication. Instead, Cassandra denormalizes data by redundant data in multiple tables of a query centric data model. If the Cassandra data model cannot fully integrate the complexity of relationships between different entities for a particular query, client side joins in the application code can be used.

1.6.   Data modeling example

Suppose there is a set of magazine data, and the attributes include magazine ID, magazine name, publishing frequency, publishing date and publisher.

Query 1: list all magazine names, including their publishing frequency.

Since there is no need to query all attributes, the data model will only consist of ID (used for partition key), magazine name and publishing frequency, as shown in the following figure:


Query 2: list all magazine names by publisher

The publisher is added to the output column, and the publisher is used as the partition key, as shown in the figure below

1.7.   Define schema

Query 1 is defined as follows:

CREATE TABLE magazine_name (
    id int PRIMARY KEY,
    name text,
    publication_requency text

For query 2, the definition is as follows:

CREATE TABLE magazine_publisher (
    publisher text,
    id int,
    name text, 
    publication_requency text,
    PRIMARY KEY (publisher, id)

two   Conceptual data modeling

First, create a simple domain model that is easy to understand in the relational world, and then see how to map it from relational to distributed hash table model in Cassandra.

Taking hotel reservation as an example, the conceptual areas include hotels, the guests staying in hotels, the collection of rooms in each hotel, the price and vacancy of these rooms, and the reservation records for guests. Hotels usually also maintain a collection of “attractions,” including parks, museums, shopping galleries, monuments or other places near the hotel that guests may visit during their stay. Hotels and points of interest need to maintain geographic location data so that they can be found on the map to mix and match, and calculate the distance. The ER diagram is as follows:


It is clear at a glance that a hotel has multiple rooms, and there are multiple leisure facilities in a room. The leisure situation of the room is also divided into time periods. There are multiple scenic spots near the hotel. A customer can book multiple rooms, and each reservation record corresponds to multiple rooms.

three   Design of relational database

When we build a new data-driven application, we will use a relational database. Firstly, domain objects are transformed into a set of normalized tables, and foreign keys are used to refer to related data in other tables.


3.1.   Design differences between RDBMS and Cassandra

There is no join

In Cassandra, the join operation cannot be performed. If you have designed a data model and need a data connection to other tables, you have to make this connection on the client, or create a non standardized second table to represent the connection results. The latter method is the preferred method for Cassandra data modeling.

• There are no foreign key references

In a relational database, you can specify a foreign key in a table to refer to the primary key of another table. But Cassandra does not require that foreign keys must be defined to reference. It is still a common design requirement to store IDS related to other entities in the table, but cascade deletion and other operations are not available.

• Non standardization

In the design of relational database, the importance of paradigm is often emphasized. There are three paradigms in database design. But in Cassandra, following the paradigm is not a good choice, because it usually performs best when not following the paradigm.

The second reason relational databases are deliberately de normalized is the business document structure that needs to be preserved. In other words, there is a closed table, which refers to many external tables. The data of these tables may change over time, but you need to save the closed document as a snapshot in the history. The most common example here is the invoice. Suppose you already have the customer and product tables, you may think you can only make an invoice for these tables, but you should never do so in practice. Customer or price information may change, and then you will lose the integrity of the invoice date on the invoice document, which may violate auditing, reporting, or law, and lead to other problems. As you can see, redundancy is necessary in this case.

In Cassandra, denormalization is perfectly normal.

Query first

In short, relational modeling means that the conceptual domain model starts, using tables to represent domain objects, using table fields to represent domain objects’ properties, and then setting primary keys and foreign keys to represent the previous relationships of domain objects. If there is a many to many relationship, we need to build another intermediate table. Queries in the relational world are secondary. As long as the table is properly modeled, you can always get the data you need. This is usually true even if several complex subqueries or join statements have to be used.

In Cassandra, by contrast, it doesn’t start with the data model, it starts with the query model. Instead of modeling data first and then writing queries, Cassandra models queries first and organizes data around queries. Consider the most common query paths that your application will use, and then create the tables you need to support them.

• Optimal storage design

In relational database, for users, how tables are stored on disk is usually transparent, and they don’t care. However, this is an important consideration in Cassandra. Since Cassandra tables are all stored in separate files on disk, it is important to define related columns together in the same table.

When creating a data model in Cassandra, a key goal is to minimize the number of partitions that must be searched to satisfy a given query. Since partitions are storage units that are not partitioned across nodes, queries that search for a single partition usually produce the best performance.

• Sequencing is a design decision

In RDBMS, you can easily change the order of records returned by using order by in queries. The default sort order is not configurable; By default, records are returned in the order they are written. If you want to change the order, you just need to modify the query, and you can sort by any column.

In Cassandra, however, sorting is handled differently. It’s a design decision. The sort order available in the query is fixed and completely determined by the selection of cluster columns provided in the create table command. The CQL select statement does support the order by semantics, but only in the order specified by the clustered column.

four   Define queries for applications

Since it’s query driven, let’s take a look at the example of hotel reservation. All businesses need to query. After all, technology serves the business, so it’s not advisable to set aside business design.

(voice over: at this moment, I suddenly think of the sentence “technology supports business, technology expands business, technology creates business”)

In the case of hotel reservation, we can sort out the following business queries:

Q1: find hotels near a scenic spot
Q2: find information about a hotel
Q3: find a scenic spot near a hotel
Q4: find a room available within a given date range
Q5: find room prices and facilities
Q6: find booking by confirmation code
Q7: find reservations by hotel, date and customer name
Q8: find all reservations by customer name
Q9: view customer details


five   Logical data modeling

In order to show the data model more vividly, the following chart method is adopted here:


5.1.   Hotel logical data model

According to the above chart, the logical data model of the hotel is as follows:

5.2.   Reservation logical data model

In the same way, the reservation logical data model is represented as follows:

six   Physical data modeling

(voice over: all designs are for searching, which sounds familiar. Ha ha, elasticsearch said that all designs are for improving the performance of searching.)

For the convenience of understanding, it is expressed in the following format. Needless to say, look at the picture


Hotel data model:

Booking data model:

seven   Evaluate and improve data model

7.1.   Calculate partition size

The first thing to consider is whether the partition of the table is too large, or in other words, too wide. Partition size is measured by the number of cells (values) stored in the partition. Cassandra’s hard limit is 2 billion cells per partition (PS: similar to cells in Excel), but before reaching this limit, you may encounter performance problems.

Partition size formula: n_ v = N_ r (N_ c – N_{ pk} – N_ s) + N_ s

Among them:

N_ R is the number of rows
N_ C is the number of columns
N_ PK is the number of primary key columns
N_ S is the number of static columns
N_ V is the number of cells

Then, the number of cells = the number of rows × ( Total columns – primary key columns – static columns) + static columns

With available_ rooms_ by_ hotel_ Take the date table as an example. According to the formula, the total number of cells in the table = the number of rows × ( 4 – 3 -0) + 0

7.2.   Calculate disk size

The amount of disk space occupied by each data type is different, which can be roughly calculated by multiplying the disk size occupied by all columns by the number of rows

7.3.   Split large partition

A technique called bucketing is often used to divide data into medium-sized partitions. For example, you can decompose available by adding a month column (possibly expressed as an integer) to the partition key_ rooms_ by_ hotel_ Date table. The comparison with the original design is shown in the figure below. Although the month column partially repeats the date, it provides a good way to group the relevant data in the partition, and the partition will not become too large.


eight   Define database schema

Schema can be understood as a database. A schema is a database

The following is the schema defined for the hotel keyspace:

CREATE KEYSPACE hotel WITH replication =
  {‘class’: ‘SimpleStrategy’, ‘replication_factor’ : 3};

CREATE TYPE hotel.address (
  street text,
  city text,
  state_or_province text,
  postal_code text,
  country text );

CREATE TABLE hotel.hotels_by_poi (
  poi_name text,
  hotel_id text,
  name text,
  phone text,
  address frozen,
  PRIMARY KEY ((poi_name), hotel_id) )
  WITH comment = ‘Q1. Find hotels near given poi’

  id text PRIMARY KEY,
  name text,
  phone text,
  address frozen,
  pois set )
  WITH comment = ‘Q2. Find information about a hotel’;

CREATE TABLE hotel.pois_by_hotel (
  poi_name text,
  hotel_id text,
  description text,
  PRIMARY KEY ((hotel_id), poi_name) )
  WITH comment = Q3. Find pois near a hotel’;

CREATE TABLE hotel.available_rooms_by_hotel_date (
  hotel_id text,
  date date,
  room_number smallint,
  is_available boolean,
  PRIMARY KEY ((hotel_id), date, room_number) )
  WITH comment = ‘Q4. Find available rooms by hotel date’;

CREATE TABLE hotel.amenities_by_room (
  hotel_id text,
  room_number smallint,
  amenity_name text,
  description text,
  PRIMARY KEY ((hotel_id, room_number), amenity_name) )
  WITH comment = ‘Q5. Find amenities for a room’;

The schema of reservation keyspace is as follows:

CREATE KEYSPACE reservation WITH replication = {‘class’:
  ‘SimpleStrategy’, ‘replication_factor’ : 3};

CREATE TYPE reservation.address (
  street text,
  city text,
  state_or_province text,
  postal_code text,
  country text );

CREATE TABLE reservation.reservations_by_confirmation (
  confirm_number text,
  hotel_id text,
  start_date date,
  end_date date,
  room_number smallint,
  guest_id uuid,
  PRIMARY KEY (confirm_number) )
  WITH comment = ‘Q6. Find reservations by confirmation number’;

CREATE TABLE reservation.reservations_by_hotel_date (
  hotel_id text,
  start_date date,
  end_date date,
  room_number smallint,
  confirm_number text,
  guest_id uuid,
  PRIMARY KEY ((hotel_id, start_date), room_number) )
  WITH comment = ‘Q7. Find reservations by hotel and date’;

CREATE TABLE reservation.reservations_by_guest (
  guest_last_name text,
  hotel_id text,
  start_date date,
  end_date date,
  room_number smallint,
  confirm_number text,
  guest_id uuid,
  PRIMARY KEY ((guest_last_name), hotel_id) )
  WITH comment = ‘Q8. Find reservations by guest name’;

CREATE TABLE reservation.guests (
  guest_id uuid PRIMARY KEY,
  first_name text,
  last_name text,
  title text,
  emails set,
  phone_numbers list,
  addresses map,
  confirm_number text )
  WITH comment = ‘Q9. Find guest by ID’;

nine   file 

Recommended Today

Implementation example of go operation etcd

etcdIt is an open-source, distributed key value pair data storage system, which provides shared configuration, service registration and discovery. This paper mainly introduces the installation and use of etcd. Etcdetcd introduction etcdIt is an open source and highly available distributed key value storage system developed with go language, which can be used to configure sharing […]