Mysql database, how to deal with duplicate data?

Time:2021-1-27

preface

In fact, there are two ways to solve the problem of whether duplicate data can be inserted, just like water control, the first is from the source, and the second is on the way of water flow. We continue to look down with these two ideas:

problem

In our MySQL database, there are often some duplicate data. In some cases, we allow the existence of duplicate data, but sometimes we also need to delete these duplicate data. How do we deal with it?

Method 1: prevent duplicate data

That is to say, when we redesign the table, we should set a unique index for these data, so that its uniqueness can be guaranteed when inserting, and there will be no duplicate data. Of course, you can also set it to primary key directly. The effect is the same.
Let’s look at a case: there is no index or primary key in the following table, so multiple duplicate records are allowed in this table.

CREATE TABLE student
(
    first_name CHAR(20),
    last_name CHAR(20),
    sex CHAR(10)
);

Now first_ name,last_ Name can be repeated. If you don’t want to repeat it, there are two solutions:

1. Set double primary key mode

CREATE TABLE student
(
   first_name CHAR(20) NOT NULL,
   last_name CHAR(20) NOT NULL,
   sex CHAR(10),
   PRIMARY KEY (last_name, first_name)
);

You can’t insert duplicate data now.

2. Add unique index

CREATE TABLE student
(
   first_name CHAR(20) NOT NULL,
   last_name CHAR(20) NOT NULL,
   sex CHAR(10)
   UNIQUE (last_name, first_name)
);

These two forms seem to have a little difference, but they can play the same role. At this time, we can insert two duplicate data and find the error.
Of course, we can also verify it in the database:

SELECT COUNT(*) as repetitions, last_name, first_name
       FROM student
       GROUP BY last_name, first_name
       HAVING repetitions > 1;

Here we are counting first_ Name and last_ The number of duplicate records of name has been set in two ways. Here it must be 0.

Method 2: specify whether duplicate data can be inserted during insertion

Here we use the insert ignore into and insert into instructions.

(1) Insert ignore into will ignore the existing data in the database. If there is no data in the database, new data will be inserted. If there is data, this data will be skipped. In this way, the existing data in the database can be retained to achieve the purpose of inserting data in the gap.

(2) Insert into, on the contrary, inserts data directly, regardless of whether the database contains duplicate data.

Let’s give an example:

insert ignore into student (last_ name, first_ Name) values ('zhang San ','li Si');
//Results
Query OK, 1 rows affected (0.00 sec)
insert ignore into student (last_ name, first_ Name) values ('zhang San ','li Si');
//Results
Query OK, 0 rows affected (0.00 sec)

Now we can see it. That is to say, when the first insert operation is executed, if the database does not have one, a new record will be inserted directly, so one row of records will be affected. But when the second insert operation is executed, the database already has one, so it will not be inserted, and row 0 will be affected.

Of course, there is another instruction that can also perform a function similar to insert ignore into, which is replace into. He said that if there are records with the same primary or unique, they should be deleted first. Insert a new record.

Method 3: filter duplicate data

If you need to read non duplicate data, you can use the distinct keyword in the select statement to filter duplicate data.

SELECT DISTINCT last_name, first_name
FROM student
ORDER BY last_name;
You can also use group by to read non duplicate data in the data table
SELECT last_name, first_name
FROM student
GROUP BY (last_name, first_name);

Method 4: delete duplicate data

This situation is actually equivalent to solving at the end of the water. Look at the following SQL statement:

//Create a temporary table according to the student, and use group by to filter the duplicate data

CREATE TABLE tmp SELECT last_name, first_name, sex
        FROM student;
        GROUP BY (last_name, first_name);
//Delete the original student table
DROP TABLE student;
//Rename the temporary table
ALTER TABLE tmp RENAME TO stu;

Of course, you can also add index and primary key to the data table to delete duplicate records. The method is as follows

ALTER IGNORE TABLE student
ADD PRIMARY KEY (last_name, first_name);

OK, there are so many solutions. I believe it can solve your problem.