Construction of similarity calculation and fast de duplication system for 100 billion level text data based on HBase



With the advent of the era of big data, data information brings convenience to our life, but also brings us a series of tests and challenges. This paper mainly introduces the design and implementation of a system based on Apache HBase, Google simhash and other algorithms to support ten billion level text data similarity calculation and fast de duplication. This solution completely solves the problem of slow storage and calculation of multi topic massive text data in the company business level.

1、 Problems faced

1. How to choose the similarity calculation or de duplication algorithm of text?

Common algorithms include cosine angle algorithm, Euclidean distance, Jaccard similarity, longest common substring, edit distance and so on. These algorithms are relatively easy to use when there are not many text data to be compared, but in the context of massive data, if the data generated every day is calculated in tens of millions, how can we efficiently merge and remove duplicate and calculate similarity for these massive tens of millions of data?

2. How to quickly calculate text similarity or remove duplicate?

If we choose the algorithm of similarity calculation and de duplication, how can we do it? If the text data to be compared is small, we can simply traverse all the text for comparison. What should we do for a huge data set? Traversal is obviously not desirable.

3. Massive data storage and fast reading and writing

2、 Introduction of simhash algorithm

Based on problem 1, we introduce simhash algorithm to realize similarity calculation and fast de duplication of massive texts. Let’s take a brief look at the algorithm.

1. Local sensitive hash

Before introducing the simhash algorithm, let’s briefly introduce what the local sensitive hash is. The basic idea of local sensitive hashing is similar to the idea of spatial domain transformation. LSH algorithm is based on the assumption that if two texts are similar in the original data space, they also have high similarity after hash function transformation; on the contrary, if they are not similar, they should still not have similarity after transformation.

The most important feature of local sensitive hashing is to maintain the similarity of data. Let’s take a small example to illustrate: after fine-tuning article a, we call it article B (it may be just one more “de”), if we calculate the MD5 value of two articles at this time, it will be very different. The advantage of local sensitive hash is that the value after hash function transformation only changes slightly, that is, if the similarity of two articles is very high, then the similarity will be very high after algorithm transformation.

Minhash and simhash belong to local sensitive hashing. In general, if each feature has no weight, minhash is better than simhash, and simhash is suitable when it has weight. Long text using simhash effect is very good, short text using simhash preparation is not high.

2. Simhash algorithm

Simhash is a fingerprint generation algorithm or fingerprint extraction algorithm mentioned in the paper “detecting near duplications for web crawling” published by Google in 2007. It is widely used by Google to remove duplicate jobs from hundreds of millions of web pages The main idea of simhash is dimension reduction. After simhash dimension reduction, we may only get a 32 or 64 bit binary string composed of 01. One dimensional query is very fast.

The working principle of simhash is omitted here. You can simply understand it as: we can use simhash algorithm to generate a 32 or 64 bit binary string (vector fingerprint) composed of 01 for each web page / article, such as: 10000100101011111100000101010001111100001001011011011.

3. Hamming distance

The Hamming distance of two codewords is the number of bits with different values. In an efficient coding set, the minimum Hamming distance of any two codewords is called the Hamming distance of the coding set. Examples are as follows: 10101 and 00110 start from the first, and the first, the fourth and the fifth are different, then the Hamming distance is 3.

In the data given in Google’s paper, if the Hamming distance of 64 bit signature is 3, two documents can be considered to be similar or duplicate. Of course, this value is only a reference value.

In this way, based on simhash algorithm, we can transform 10 billion high-dimensional feature articles into one-dimensional strings, and then judge the similarity of web pages / articles by calculating their Hamming distance. The efficiency will be greatly improved.

3、 Efficiency issues

Here, the problem of similarity is basically solved, but according to this idea, the problem of efficiency has not been solved under the mass data of tens of billions, because the data is constantly added in, and it is impossible to compare every piece of data with the data of the whole database. According to this idea, the processing speed will be slower and slower, with linear growth.

Here, we will introduce a new concept:Drawer principleIt is also called pigeon nest principle. Here are some simple examples:

There are four apples on the table, but there are only three drawers. If you want to put four apples in three drawers, there must be two apples in one drawer. If each drawer represents a set, each apple can represent an element. If n + 1 elements are put into N sets, there must be at least two elements in one set.

Drawer principle is so simple, if we use it to solve our massive data traversal problem?

For the efficiency of mass data De duplication, we can divide the 64 bit fingerprint into four 16 bit data blocks. According to the drawer principle, in the case of Hamming distance of 3, if two documents are similar, then the data of one block must be equal.

That is to say, we can take each 16 bit truncated fingerprint of a text’s simhash as the key. When the value is equal to the key, the text’s simhash set can be stored in the K-V database. When querying, we can accurately match the four simhash sets corresponding to the four 16 bit truncated fingerprints of the fingerprint.

In this way, assuming that there are 2 ^ 37 pieces of data (137.5 billion data) in the sample library, and assuming that the data are evenly distributed, the maximum number of 16 bits (the combination of 16 01 numbers randomly is 2 ^ 16) inverted return is
(2^37) 4 / (2 ^ 16) = 8388608 candidate results and 4 16 bit truncated indexes. The total result is: 48388608 = 33554432, about 33.56 million
In this way, we need to compare 137.5 billion times before, but now we only need to compare 33.56 million times to get the results, which greatly improves the computational efficiency.

According to the online test data, it takes about 300ms for ordinary PC to compare the Hamming distance of 10 million times, that is to say, it only takes 3356 / 1000 * 0.3 = 1.0068s for 33.56 million times (137.5 billion data). That is to say, for the similarity calculation and de duplication of 100 billion level text data (if each text is 1KB, about 100TB data), we only need one second at most to get the result.

4、 HBase storage design

After such a big week, we finally gave you the theoretical knowledge we need to explain. In order to explain as clear and easy to understand as possible, the understanding of a lot of theoretical knowledge in this article draws lessons from a large number of blogger Daniel’s blog. The link to the original text has been attached at the end of the article. If you don’t understand, please kneel down to Daniel’s blog, ha ha!

Now we focus on the design and implementation of HBase storage table.

Based on the above, we can roughly know that if the 64 bit fingerprint is divided into four parts and the Hamming distance is taken as 3, then there must be a section of 16 bit fingerprint data that is equal. Each segment of 16 bit fingerprint corresponds to a 64 bit fingerprint set, and each 64 bit fingerprint in the set must have a segment of 16 bit fingerprint that coincides with the segment of 16 bit fingerprint. We can simply express (take 8-digit non-01 fingerprint as an example) as follows:

key value(set)
12 [12345678,12345679]
23 [12345678,12345679,23456789]

If it is implemented based on HBase, we will compare three possible design schemes.

Scheme 1:

The 16 bit fingerprint is used as the row key of HBase data table, and each possible similar 64 bit fingerprint is used as the column of HBase. The column value is stored in the article ID value, that is, a large width table is constructed. As shown in the following table (take 8-digit non-01 fingerprint as an example):

rowkey column1 column2 column3

The actual data table might look like this:

rowkey 12345678 32234567 23456789 12456789
12 1102101 1102102
23 1102104 1102105
34 1102106

In fact, if the table is designed in this way, the number of rowkeys in the HBase table is a certain value: the random combination of 16 01 numbers is 2 ^ 16. That is 2 ^ 16 = 65536 lines. The number of columns is actually fixed, that is, 2 ^ 64 = 184467440737 billion columns.

At this time, for example, we compare the similarity between 56431234 and all the texts in the library. We only need to pull the rowkey in (56,43,12,34) four rows of data to traverse each row and column. Because HBase null values are not stored, all the columns will only traverse the column names of the existing values.

According to the above calculation, if the 135 billion data is evenly distributed, there will be about 8.39 million columns in each row, not to mention that our data volume may be much larger than the 100 billion level, and not to mention the storage space occupied by the 64 bit string as the column name Each row has about 8.39 million columns. Although HBase claims to support data storage of tens of millions of rows and columns, the design is still too unreasonable. The data will not be idealized and evenly distributed, and the total number of columns is 184467440737 billion, which is also worrying.

Scheme 2:

The combination of 16 bit fingerprint and 64 bit fingerprint is used as the row key of HBase data table. The table has only one column, and the column value stores the article ID value, that is to build a large long table. As shown in the following table (take 8-digit non-01 fingerprint as an example):

rowkey id

The actual data table might look like this:

rowkey id
12_12345678 1
34_12345678 1
56_12345678 1
78_12345678 1
34_22345678 2
23_12235678 3

This design feels better than the first method. Each article will be saved as four lines. But there are also many disadvantages. One is that the rowkey is too long. The other is that even if we solve the problem through some kind of transformation design. First, when we get data, we can only convert the get request to four scan concurrent scans + startenkey to scan the table to get data. Of course, if you want to achieve sequential scanning, there may be hot issues. In storage, it also causes a lot of data redundancy.

Scheme 3:

In the real production environment, we adopt this scheme to avoid the problems and deficiencies in the above two schemes. Here is a brief introduction (if you have a better and better solution, please leave a message, thank you first!)

In short, I maintain a set set (coprocessor) on the HBase side and store it as a JSON string. The format is as follows:


Based on the fact that there are multiple topic types of text data in the company, and they are isolated from each other, the de duplication and similarity calculation are also carried out by topic

Rowkey = HashNumber_ ContentType_ 16simhash (24 bits in total)

  • Hashnumber: to prevent hot spots, hash pre partition the table (64 pre partitions), accounting for 2 characters

The calculation formula is as follows: String.format (“%02x”, Math.abs ( key.hashCode ()) % 64)

  • Contenttype: content topic type, 4 characters
  • 16simhash: fingerprint capture by 16 bit simhash, composed of 01

The structure of the table is as follows:

rowkey si s0 s1 s2 s3
01_news_010101010101010101 value 1 JSON string
02_news_010101010101010110 value 2 JSON string JSON string
03_news_100101010101010110 value 3 JSON string JSON string JSON string
01_xbbs_010101010101010101 value 1 JSON string

Si: the value passed by the client to be stored, which is composed of 64 bit simhash and ID through double underline splicing, such as simhash__ The form of ID.
S0: record how many sets there are in the row, and each set stores 10000 K-V pairs (about 1MB).
S1: the first set is stored in JSON string. If size > 10000, the subsequent data will be stored in S2.
S2: and so on.

Of course, the most important part is that the JSON string in S1 / S2 / S3 should be duplicated. The simplest way is to read all the data in the set to the client before saving the data, compare the data to be saved with all the data in the set, and then insert it again. This will bring a lot of round-trip IO overhead and affect the write performance. Therefore, we introduce HBase coprocessor technology to avoid this problem, that is, to complete all the de duplication operations on the server side. The general code is as follows:

package com.learn.share.scenarios.observers;

import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hbase.Cell;
import org.apache.hadoop.hbase.CellUtil;
import org.apache.hadoop.hbase.CoprocessorEnvironment;
import org.apache.hadoop.hbase.client.Durability;
import org.apache.hadoop.hbase.client.Get;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.coprocessor.BaseRegionObserver;
import org.apache.hadoop.hbase.coprocessor.ObserverContext;
import org.apache.hadoop.hbase.coprocessor.RegionCoprocessorEnvironment;
import org.apache.hadoop.hbase.regionserver.wal.WALEdit;
import org.apache.hadoop.hbase.util.Bytes;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.util.List;

 *Construction of 10 billion level text de duplication system based on coprocessor
public class HBaseSimHashSetBuildSystem extends BaseRegionObserver {

    private Logger logger = LoggerFactory.getLogger(HBaseSimHashSetBuildSystem.class);

    public void start(CoprocessorEnvironment e) throws IOException {"Coprocessor opration start...");

     * @param e
     * @param put
     * @param edit
     * @param durability
     * @throws IOException
    public void prePut(ObserverContext<RegionCoprocessorEnvironment> e, Put put, WALEdit edit, Durability durability) throws IOException {
        // test flag"do something before Put Opration...");

        List<Cell> cells = put.get(Bytes.toBytes("f"), Bytes.toBytes("si"));
        if (cells == null || cells.size() == 0) {
        String simhash__itemid = Bytes.toString(CellUtil.cloneValue(cells.get(0)));
        if (StringUtils.isEmpty(simhash__itemid)||simhash__itemid.split("__").length!=2){
        String simhash = simhash__itemid.trim().split("__")[0];
        String itemid = simhash__itemid.trim().split("__")[1];

        //Get put rowkey
        byte[] row = put.getRow();
        //Constructing get object through rowkey
        Get get = new Get(row);
        Result result = e.getEnvironment().getRegion().get(get);
        Cell columnCell = result.getColumnLatestCell(Bytes.toBytes("f"), Bytes.toBytes("s0")); // set size
        if (columnCell == null) {
            //Store data for the first time, initialize size to 1
   ("store data for the first time, initialize size to 1");

            JsonObject jsonObject = new JsonObject();
            Gson gson = new Gson();
            String json = gson.toJson(jsonObject);

            put.addColumn ( Bytes.toBytes ("f"), Bytes.toBytes ("s1"),  Bytes.toBytes (JSON)); // JSON array
            put.addColumn ( Bytes.toBytes ("f"), Bytes.toBytes ("s0"),  Bytes.toBytes ("1"); // initialize
        }else {
            byte[] sizebyte = CellUtil.cloneValue(columnCell);
            int size = Integer.parseInt(Bytes.toString(sizebyte));
   (not the first time to store data ----- > rowkey ')+ Bytes.toString (row)+"` simhash set size is : "+size +", the current value is : "+simhash__ itemid);
            for (int i = 1; i <= size; i++) {
                Cell cell1 = result.getColumnLatestCell(Bytes.toBytes("f"), Bytes.toBytes("s"+i));
                String jsonBefore = Bytes.toString(CellUtil.cloneValue(cell1));
                Gson gson = new Gson();
                JsonObject jsonObject = gson.fromJson(jsonBefore, JsonObject.class);
                int sizeBefore = jsonObject.entrySet().size();
                        if (sizeBefore==10000){
                            JsonObject jsonone = new JsonObject();
                            String jsonstrone = gson.toJson(jsonone);
                            put.addColumn ( Bytes.toBytes ("f"), Bytes.toBytes ("s"+(size+1)),  Bytes.toBytes (jsonstrone)); // JSON array
                            put.addColumn ( Bytes.toBytes ("f"), Bytes.toBytes ("s0"),  Bytes.toBytes ((size + 1) + "); // initialization
                        }else {
                            String jsonAfter = gson.toJson(jsonObject);
                            put.addColumn ( Bytes.toBytes ("f"), Bytes.toBytes ("s"+size),  Bytes.toBytes (jsonafter)); // JSON array
                    }else {
                    }else {

In this way, when we need to compare a text fingerprint with the data in the database, we only need one Table.Get (list < get >) operation can return all the data, and then get the data in each set in turn based on S0.

Now let’s calculate an account. Assuming that there are still 2 ^ 37 pieces of data (137.5 billion data) for a certain topic type data, and assuming that the data is evenly distributed, the maximum number returned by inverted 16 bits (the combination of 16 01 numbers is 2 ^ 16) is (2 ^ 37) * 4 / (2 ^ 16) = 8388608 candidate results, that is, each row has about 839 set sets, and each set set is about 1m If so, the amount of data storage will not be too large.

If you have data of ten different subjects, the number of HBase rows is (2 ^ 16) * 10 = 655360.

What about snappy compression?
What if we add fast diff coding?
What if we turn on the mob object storage again? Can each set store 100000 key value pairs? Only 90 sets per row are needed.

Perhaps, if the amount of data is small, is it better to use redis?

In a word, there are still many aspects of optimization, perfection and imperfection. This article will briefly describe them here. If you have good suggestions or different opinions, please leave a message! Thank you. Good night~~



Construction of similarity calculation and fast de duplication system for 100 billion level text data based on HBase

Reprint please indicate the source! Welcome to my WeChat official account [HBase working notes]