Configuration check of Cassandra data model and schema



This documentProvides information about datastax enterprise (DSE) and Apache Cassandra Gamma General data modeling and architecture configuration recommendations. This document requires basic knowledge of DSE / Cassandra. It cannot replace official documents.


Data modeling is one of the main factors that determine the success of most projects seen by the datastax customer consulting team. The system with correct data modeling is scalable and has few problems. Systems with incorrect data modeling are usually unstable and fail even with relatively small amounts of data. This is why customer consulting teams focus on data models when auditing clusters. If you need more information about Cassandra data modeling than this, refer to the Cassandra or datastax CQL data modeling documentation.




1 data model check

This section lists a set of routine checks performed by the client consulting team when analyzing existing data models. (you can also use it for data models in development.) They get the existing schema from the diagnostic compressed file (tarball) produced by opscenter or from the diagnostic collection script. You can do this by executing cqlsh – E ‘describe schema;’ on a cluster node and then output the results to the example schema.cql File. We’ll use that name in all of this article.


In a diagnostic compression file, it is located in the driver/schema of the node. In addition to information about schema, you can also use the nodetoool commands, which are executed on nodes in the cluster (or extracted from diagnostic compression files), because in some cases only some nodes will be affected.


When analyzing the data model, consider the hard constraints in Cassandra and DSE (version dependent) associated with CQL, as well as the recommendations in this article.




2 keyspace replication settings

Check all keyspaces to make sure they have the correct replication settings. Note the following:


Wrong replication policy

When your cluster has multiple data centers, use network topology strategy instead of simplestregy. If simplestregy is used, the replica may not be guaranteed to be placed in the correct data center location.


Tip: even if you have only one data center, it’s better to use network topology strategy, because if you decide to add a data center in the future, such a setting will simplify the problem.


2.2 keyspace does not replicate enough or does not replicate to all data centers

This is true for system keyspaces (for example, system_ Auth). For example, if you lose the_ The data copy of auth, you or your application may lose the ability to log in to the cluster.


2.3 over replication of keyspace

Sometimes in a large cluster, some keyspaces have a much higher replication factor than the usual setting of (“3”). In some cases, it is a significant number, such as “5”. Higher values usually increase the latency of read and write operations, especially when using consistency levels, such as quorum or local_ QURUM。 If you want to further protect data and ensure cluster availability, consider adding new data centers, backups, etc.


2.4 even numbers for replication factor (RF)

Usually, in the case of Qurum or local_ At a consistency level such as Qurum, even as the number of copies doesn’t work well because it makes the cluster less adaptable to failures. Quorum counts as N / 2 + 1, where n is the number of copies of the cluster. LOCAL_ Quorum uses the same number, but n is the number of copies in a particular data center.


For example, for RF = 2, the number of copies of quarum is equal to 2, so when a node fails, the operation fails. If you increase the RF to 3, this will not happen because the number of copies for quorum is still 2. This means that if a node fails, the operation will not be affected, as shown in the following table:


Replication factor

Without losing cluster consistency level quorum or implementing local in a data center_ The number of nodes that can be shut down in the case of quorum capability














To solve the replication problem, you can manually execute the alter keyspace command, or use adjust- Scripts or similar commands perform these operations automatically. System keyspaces that use local strategy or everywhere strategy must remain unchanged.




3 table number

The number of tables in Cassandra can directly affect the performance of the cluster. In general, it is recommended that there should be no more than 200 active tables in a cluster. Even if the cluster is working properly, having 500 active tables is considered a failure level because it is likely to be inefficient and prone to failure.


The problem is that each table requires about 1 MB of memory for metadata. For each table in use, the system allocates a memtable. Tables with large amounts of data also store more data for Bloom filters and other auxiliary data structures, which also increases the pressure on memory. Moreover, each keyspace also imposes an additional burden on the JVM memory. All of these together affect Cassandra’s performance. The following benchmarks show that the increase in the number of tables leads to a significant decrease in throughput:

Configuration check of Cassandra data model and schemaTo check how many tables and keyspaces are available in the cluster:

$ grep 'CREATE TABLE' schema.cql |wc -l





4 structure of checklist

The following checks should be done in the definition of the table, which may affect the operation performance of the cluster.


4.1 check the structure of primary key and partition key

The structure of primary keys (especially partition keys) can have a significant impact on the performance and stability of a cluster. When analyzing the structure of a table, consider the following factors:

  • When the primary key consists of only partition keys, the size of the row may be too small. Because the metadata associated with the partition may be larger than the size of the row itself, it can lead to inefficiency when accessing or storing data.

  • When the table is composed of one column, check the data type of the partition key. Some data types (by definition) have low cardinality, such as Boolean or tinyint, which may lead to uneven data distribution between nodes. For example, if you define a column with a boolean type, there will be only two partitions in the table. You may also get larger partitions when there are many rows in the partition.


Using date types as partition key columns can cause another potential problem. In many cases, if people use date type as partition key and press “day” to organize and write data, the application will write / read a large amount of data (hundreds and thousands of requests per second) in a partition for a certain day, which easily leads to hot spots.


4.2 number of columns in the checklist

We do not recommend defining hundreds or thousands of columns for a single table because:

  • Easily exceed the commonly recommended maximum number of cells per partition (too many columns per row). See the number of cells per partition below.

  • At least 8 bytes of the time stamp associated with each unit of storage is added. If there is a TTL, more loads will be added.

  • It may affect the performance of range scanning.


If there are too many columns in the table, analyze the data access pattern first. Solutions include:

  • If several columns are read together frequently, they can be combined into a frozen user-defined type (UDT), where all data in the UDT is written as a unit.

  • The serialization and deserialization of data is performed within the application.

  • Store data as blobs.


4.3 check the applicability of data use types

Cassandra provides a rich set of data types that can be used for a series of tables. Due to too many data types, users often use incorrect data types. For example, a text type is used to store time stamps, and an improper numeric type (whose range of values is much larger than required, for example, a column that would have been sufficient with an int would have used a long type). This improper use can lead to the following problems:

  • Use disk space unnecessarily. For example, a text type that labels a timestamp as an iso-8601 encoding class takes 28 bytes, while a timestamp type uses only 8 bytes. Similarly, for numeric types, the long type takes 8 bytes, while int uses only 4 bytes. The situation is worse with decimal and variant types, because they are not fixed in size and depend on the actual value.

  • If you use DSE search, you may not be able to search the data correctly. For example, if you use a text data type to store numbers or timestamps, you may not be able to perform range queries.

  • When the data is not encoded correctly, you may not be able to perform the correct sort of data.


4.4 checking the use of set types

Cassandra provides several data types to store multiple values in a single column: lists, collections, and maps. Each type requires defining the type of elements in the collection when creating tables. The collection type is:


The entire contents of the collection are serialized and stored as a value. This type solves some of the problems described below, but does not allow you to update individual elements of the collection.


The collection is stored in a separate cell as a set of individual elements.


Collection types are easy to develop. Please consider the following factors when using:

  • The extra load used to preserve metadata for individual elements when using unfrozen collections. This includes the write timestamp and optional TTL. For list types, the element index using UUID (16 bytes per element) requires additional load to store.

  • When an insert or full update of a non frozen set occurs, for example, when one value replaces another value of a column (for example, update table set field = New_ value…) Cassandra inserts a tombstone marker to prevent overlapping with previous data, even if the data has not existed before. A large number of tombstones can seriously affect the reading performance.

  • There is an upper limit to the number of elements in the collection. For Cassandra 3.0.1 / 3.1 and later: 2 billion. For earlier versions: 65535. An excessive number of elements can exceed the maximum write size limit when accessing data in a non frozen collection or using a frozen collection, resulting in performance problems. In addition, when reading a sequence with a collection type, its entire content will be returned, so the transmission of such a large amount of data may damage performance.

  • For the unfrozen collection, after a single element is inserted and updated, the data may be scattered among multiple sstables, and the actual column value needs to be rebuilt after being read, which may lead to performance degradation.

  • Since read repair does not propagate tombstones, the contents of collections with deleted elements may be affected. This happens because custom tombstones as delete marks are not propagated.


Following these rules can alleviate the problems listed above:

  • Use frozen collections until it is necessary to update individual elements.

  • Keep the number of elements in all collection types in the order of tens, and the maximum number is hundreds. The contents of the collection column are read as a whole, so if there are too many elements, there will be a read problem because the maximum possible size of the page is 256 MB.


Note: when a query returns many rows, it is inefficient to return them as a single response message. Instead, the driver divides the results into pages that are returned as needed. The application can control how many rows are included in a single page, but the maximum page size is defined by the native protocol.

  • To prevent tombstones from being created, you can use append operations on columns when you insert data into a collection or map (or perform a full update of the set or map) if you know that no data existed before. For example:



m map



Not using:

INSERT INTO test.m1(id, m) VALUES (1, {1:'t1', 2:'t2'});


UPDATE test.m1 SET m = {1:'t1', 2:'t2'} WHERE id = 1;



That will generate a tombstone, execute:

UPDATE test.m1 SET m = m + {1:'t1', 2:'t2'} WHERE id = 1;


In this case, the result is the same, but no tombstone is generated.


If only one column in the table has a collection type, you can model it as another cluster column. For example:



m map



You can create this table without a mapped column (use the same method for sets and lists):


id int,

m_key int,

m_value text,

PRIMARY KEY(id, m_key)



You can do this by omitting M_ Key to select all values for a particular partition, or to select only specific elements by providing a complete primary key. This is a greater advantage than a set type column that will be returned as a whole.


4.5 use of checklist types

Everything described in the previous section also applies to list types. However, there are other limitations to the list type:

  • Setting and deleting elements by location, as well as deleting the occurrence of specific values, will cause internal read before write.

  • The prefix or append operation is not idempotent.


In case of failure, you can’t simply try the operation again because you don’t know if the operation is complete. Retrying causes duplicate elements; if you do not try again, you may lose data. For more information, see the list fields documentation.


If you don’t need to arrange elements in a specific order or have duplicate values, use a collection type instead of a list type. If you still need to use a column of list type, consider using its frozen version.


4.6 checking the use of user-defined types

Cassandra allows the creation of user-defined types (UDTs). You can use the UDT type to group related information together, using each combination as a single entity. From the perspective of data model analysis, you can apply the same rules as sets:

  • Use frozen UDTs whenever possible.

  • For non frozen UDTs, do not specify too many fields.


However, UDT still has problems related to the serialization / deserialization of UDT. Another problem from the perspective of schema evolution is that although you can add fields to UDT, you cannot delete them. This means that UDT should only be used in very limited cases where it is very necessary. Otherwise, it is best to define this data as a regular column of the table. Another option is to serialize and deserialize UDT data within the application and store the data as a blob.


4.7 check the use of deeply nested UDTs and UDT sets

Although UDTs can be nested in other UDTs or as elements in a collection, you must be very careful. If there are too many elements in the collection or too many nested UDTs, the maximum write value is reached, causing the operation to fail.


4.8 checking the use of tuple types

CQL provides tuple data types, which can group multiple elements of different data types into an entity. The limitations of this type are:

  • Its value is always frozen, which means that the column is rewritten every time it is updated.

  • You have to access elements by location, which makes it more difficult to develop code because you need to remember which type is used in which location and what the location means.


Due to these limitations, datastax recommends that you do not use this data type and use UDT instead.


4.9 check the use of counter data types

The counter data type allows you to increment and decrement, which is useful for some applications. Since Cassandra 2.1, the execution of the counter is more robust, but there are still limitations

  • When a node fails, a write is lost, or something like that, the counter value may not be accurate because the counter operation is not idempotent and cannot be retried: retrying may cause an excessive count; if not, it may be under counted.

  • Tables may only contain regular columns for counter types; it is not possible to mix them with other data types.


4.10 checking the use of BLOB data types

Cassandra supports storing binary data in database by providing blob type. When using blobs, make sure you don’t store objects larger than a few hundred kb in Cassandra, otherwise problems may occur when getting data from the database. For example, queries may fail when the size of the retrieved page is larger than the limit set by the native protocol (256MB).


4.11 defining the sort order of cluster columns

When you define a table, you can define the sort direction of the cluster columns. When executing a query, the application can reverse the defined sort direction, but it is not as efficient as reading data in the same sort direction (defined at the table level). Datastax recommends defining the correct sort direction when creating tables. Similarly, if the sort is reversed at query time, it affects all columns, not just one column, causing Cassandra to read data in the opposite direction.




5 number of cells per partition

Cassandra documents often use the term “cell” to describe the stored values of regular columns (non primary key columns). In addition to the actual values, each cell also has associated metadata, such as time stamps, optional TTL, and other data for complex cells. Collections and user-defined types are more complex.


Cassandra has a hard limit of two billion units per partition. To ensure that read operations are predictable, datastex recommends limiting the number of cells in the partition to make the partition less than 100 MB.


You can use the nodetool tablehistograms command (cfhistograms in old Cassandra) to check the number of cells in each partition. The latest version of Cassandra and DSE can output the data of all tables in the system, while the older version needs to give the specific keyspace and table name.


Look at the unit count column of the output and check the values in the 99% percentile and maximum rows. If your value is greater than 100000, consider changing the data model; this may indicate a large partition (described in the next section), too many columns, or too many elements in a non frozen collection.


$ nodetool tablehistograms test widerows


test/widerows histograms
Percentile     SSTables     Write Latency     Read Latency     Partition Size     Cell Count
                            (micros)          (micros)         (bytes)
50%           1.00          0.00              1310.72          545791             103
75%           1.00          0.00              1310.72          545791             124
95%           1.00          0.00              1310.72          545791             124
98%           1.00          0.00              1310.72          545791             124
99%           1.00          0.00              1310.72          545791             124
Min           1.00          0.00              1310.72          454827             87
Max           1.00          0.00              1572.86          545791             124





6 major divisions

For Cassandra, we recommend keeping the partition size below 100MB. The existence of large partitions indicates that there are errors in the data model, which are usually triggered by the following factors:

  • The cardinality of the partition key is low. ——There are too few possible values for the partition key.

  • The data is not evenly distributed between partitions.

For example, if you use the customer ID as the partition key, the application for the large customer will write more data than the small customer. As a result, some nodes may have more data than others. More data will increase the load on these nodes because they need to process more requests, need more compaction operations, etc.

  • There are too many columns and rows in a table, especially when each row contains data for all or most of the columns.

  • Store large blobs or long texts in the table.

  • Large partitions impose an additional burden on Cassandra, such as allocating extra memory to hold the partition index. Note: prior to Cassandra version 3.6, reading large partitions put more pressure on Java heap and often led to node crashes.

  • When some nodes process more requests than others, uneven data allocation between nodes will lead to hot spots.

  • When reading the entire partition, more data needs to be transferred between large partitions.

  • The size of the Cassandra partition affects external systems, such as spark, because Cassandra’s partition is the smallest object mapped to the spark partition. Any imbalance in Cassandra can lead to an imbalance in Spark’s processing of data.


6.1 finding information about partitions

Use the following tools to find the size of the partition:

  • Use the command “histatables” and “histatool” to find the size of the old partition. If you see a big difference between these values, it may be that the partition key values are not evenly distributed. Similar information can be obtained with the sstablepartitions command.

  • Information about the maximum partition size can be obtained from nodetool tablestats (cfstats in older Cassandra). Check that the value in the “compact partition maximum bytes” row is greater than the recommended 100 MB.

  • If the partition key values are unevenly distributed, the datastax bulk loader (dsbulk) can be used to identify and find the partition key value with the maximum number of rows. The main advantage of dsbulk is that it can be used with the entire cluster. For example, to find the largest partition in a test table:

$ dsbulk count -k test -t widerows --log.verbosity 0 --stats.mode partitions

'29' 106 51.71

'96' 99 48.29
  • You can use the sstablemetadata function with the – s command line parameter to find the largest partition in a particular sstables. -The s flag is available in Cassandra 4.0 and DSE 6. X. For Cassandra 3. X, use the sstable tools project (which was inspired by the sstable metadata feature). An advantage of sstable metadata is that it provides information about the maximum partition, including the number of rows and the size of bytes. The disadvantage is that it works with a single sstable file, while a partition can be split into several different sstable files. For example:

$ sstablemetadata -u -s path_to_file/mc-1-big-Data.db

SSTable: /Users/ott/var/dse-5.1/data/cassandra/data/datastax/vehicle-8311d7d14eb211e8b416e79c15bfa204/mc-1-big

Size: 58378400

Partitions: 800

Tombstones: 0

Cells: 2982161


  [266:20180425] 534

  [202:20180425] 534

  [232:20180425] 534

  [187:20180425] 534

  [228:20180425] 534


  [266:20180425] 134568 (131.4 kB)

  [202:20180425] 134568 (131.4 kB)

  [232:20180425] 134568 (131.4 kB)

  [187:20180425] 134568 (131.4 kB)

  [228:20180425] 134568 (131.4 kB)



By looking at the row number of partitions in the tablestats / cfstats output (estimated) or by executing select distinct partition_ key_ List, count (*) from table and check the output column number to check the low cardinality of the partition key value.


The only solution to the problems described in this section is to change the data model to select the correct partition and cluster keys. In some cases, you may be able to promote a cluster key to a partition key, or introduce an artificial grouping column as a partitioning key.




7 check compaction strategy

In general, the default compaction strategy (STCs) is preferred unless it causes problems or other strategies have obvious advantages. For more information on adjusting the compaction strategy, see the separate documentation.




8 check the use of the secondary index

Cassandra and DSE provide a variety of methods to perform table search by using non partitioned key sequences

  • Secondary index

  • Materialized view

  • Sasi index

  • DSE search index – note: DSE 6.8 includes a beta version of the storage attached index (SAI).

By using these techniques, users do not have to reverse normalize the data into other tables. However, each of the above implementation methods has its own limitations.


8.1 check the use of secondary index

The primary secondary index in Cassandra is a reverse index. It builds a table internally and maps the specific value of each column to a complete primary key to index the data on each node. Since the primary key structure defined in the base table does not allow data access, the purpose of this index method is to bypass this barrier to support data access.


Secondary indexes have some limitations:

  • Only a single regular column can be indexed per index.

  • Non equality or range conditions are not supported.

  • Can be seriously affected by the cardinality of the indexed columns. If a low cardinality exists, it can result in the creation of very wide partitions. If it’s a high cardinality, you might create many very small partitions. Both can adversely affect performance.

  • Deletion is not handled well. A large number of tombstones in the secondary index will seriously degrade its performance.

  • Because secondary indexes index the data locally into the base table on each node, they can not be placed normally through the partition key. Due to the lack of partition key restriction, it will lead to distributed collection requests to all nodes in the data center during query, resulting in poor performance.

For these reasons, you must be very careful when using a secondary index, and avoid using a secondary index by reverse normalization if possible. In a single partition in a smaller partition, some tables are rarely deleted. The basic partition is located on this node, and the index can be reused in the future. If all these conditions are met, the secondary index may be a reasonable choice in filtering results. Even under these conditions, we strongly recommend thoroughly testing queries that use secondary indexes with representative data and loads.


Since Cassandra consumes resources to build and maintain a secondary index in order to keep it in a consistent state, datastax recommends keeping a relatively low number of secondary indexes and deleting all unused secondary indexes. You can check the number of secondary indexes that have been defined using the following methods:

$ grep 'CREATE INDEX' schema.cql|wc -l 



8.2 check the use of materialized views

Cassandra 3.0 and DSE 5.0 introduce support for materialized views to make it easier for client applications to automatically and transparently reverse normalize data. Materialized views are views defined at the schema level that specify the base table. These views contain the same information of the base table (or its subset), but have different primary keys, so the read mode that cannot be realized under the original key structure can be made possible by the view.


When data is written to a base table, all materialized views are automatically updated accordingly so that they can be read at any time according to their keys, just like regular tables. Note that the data is actually stored in each view, so the total footprint increases based on the number of views and the information they contain.


When using materialized views on a table, consider the following factors:

  • Constraints on the primary key structure of Materialized Views:

    • The key of a materialized view must contain all the columns that make up the base table key. This is because the definition of uniqueness of rows must be the same.

    • The key of a materialized view can contain at most one regular column in the base table, provided that the column can never be null.

  • Using materialized views on tables places an additional burden on the database, so plan resources accordingly.

  • In order to construct rows in materialized views, Cassandra needs to read the corresponding rows from the base table, which adds extra burden and delay to the IO system.

  • If the materialized view has different partition keys, the data insertion needs to communicate with other nodes responsible for the corresponding token range.

  • In some cases, materialized views may not be synchronized with the base table. If this happens, use nodetool rebuild_ View (conventional repair is not applicable to materialized views).

  • It may take some time to create materialized views on tables of existing data, depending on the size of the amount of existing data. You can use the nodetool viewbuildstatus command to check the status of the built operation.

  • In Cassandra, materialized views are still marked as experimental and are not recommended for production environments.


Although materialized views are convenient from a development perspective, you can get better performance by creating one or more auxiliary tables and writing all of them. Although it increases the complexity of the application code, it also has benefits, such as greater flexibility in defining primary keys for secondary tables, and avoiding reading data from disk before writing entries to materialized views.

If you still need to use materialized views, keep the number low. Use the following command to check the number of Materialized Views:

$ grep 'CREATE MATERIALIZED VIEW' schema.cql|wc -l 


8.3 checking the use of Sasi index

Sasi (secondary index with sstable) is another implementation of secondary index, which aims to make query conditions more flexible and improve performance. Sasi was contributed by an external contributor to Apache Cassandra. Its initial implementation was developed for a very specific use case, using the old version of Cassandra and the obsolete API.


Moreover, it has only been tested to a very limited extent. Further tests and preliminary experiments show that Sasi index is affected by multiple errors. Despite repairs and improvements, it is still unreliable and may return inconsistent results, so we don’t think it can be used in a production environment.


Datastax recommends avoiding using Sasi indexes for any queries on the production system.


You can check the usage of the Sasi index using the following command:

$ grep -e 'CREATE CUSTOM INDEX.*SASIIndex' schema.cql|wc -l


8.4 checking the use of DSE search index

DSE has its own search index implementation based on Apache Solr, which is called DSE search. This implementation transparently integrates with the core Cassandra and allows indexing of stored data. It can perform different types of searches on any column or combination of columns in a table, such as full-text search, range search, precise search, etc.


Although it is very flexible, the following points need to be considered:

Note: Apache Lucene and Solr as well as DSE search have some limitations. The storage attached index (SAI) in DSE 6.8 improves many of these limitations.

  • To build objects in the DSE search index, DSE needs to read the corresponding rows from the base table, which increases io.

  • Number of tables with DSE search index

The recommended maximum number of indexes depends on the version of DSE and hardware. See capacity planning for DSE search.

  • Usage and number of virtual nodes.

Since DSE search performs decentralized collection queries for all token ranges, the number of queries sent is proportional to the number of token ranges. Keep the number of dsvnodes or DSES as small as 8.

  • The size of the search index.

It is recommended to keep the single index side under the limit of 250 GB, and the size of all search indexes should not exceed 500 GB.

  • Which sequences and their types are indexed.

The size of the DSE search index may be significantly larger than the data size in Cassandra, depending on the type of index column and the type of index. To keep the index size under control, only the columns that are needed are indexed. Different index types also affect performance. Instead of being searched for a full-text string, for example, instead of a sub column.

  • Some data types, such as counters and freeze maps, are not supported.

  • Static columns cannot be indexed.

  • The number of objects (documents) within a single search index on a single node (up to 2 billion documents).

When an index table uses columns with user-defined types, the upper limit can be reached quickly because the columns are indexed into separate documents.

  • The consistency level of DSE search to execute the query is one. This means that if you don’t fix the data, the returned results may be different.

  • Row level access control is not supported.


You can check the usage of the DSE search index using the following command:

$ grep -e 'CREATE CUSTOM INDEX.*Cql3SolrSecondaryIndex' schema.cql|wc -l


Use the describe active search index command to access the schema and configuration of each index.