MySQL Data Table Merge and Reduplicate

Time:2019-8-12

Scene:

The crawled data is generated into a data table, which has the same structure as another main table and needs to be merged and de-duplicated.

Solution: (direct example)

  • First create two tables pep, pep2, where PEP is the main table

    CREATE TABLE IF NOT EXISTS `pep/pep2`(
    `id` INT UNSIGNED AUTO_INCREMENT,
    `no` VARCHAR(100) NOT NULL,
    PRIMARY KEY ( `id` )
    )ENGINE=InnoDB DEFAULT CHARSET=utf8;
  • Then insert two pieces of data into pep, and one piece of data in PEP 2 is the same as that in pep.

    insert into pep(no) values('abc');
    insert into pep(no) values('caa');
    
    insert into pep2(no) values('abc');
  • Insert pep2 data into PEP

    insert into pep (no) select no from pep2;
  • Grouping to recreate the new temporary table tmp

    create table tmp select id,no from pep group by no;

    Note: The ID field type of the table created is no longer primary key self-increasing

    Maybe you will make a mistake.
     ```Syntax error or access violation: 1055 Expression #1 of SELECT 
     list is not in GROUP BY clause and contains nonaggregated 
     column 'XXX.Y.ZZZZ' which is not functionally dependent on 
     columns in GROUP BY clause; this is incompatible with
      sql_mode=only_full_group_by
     ```
     Solution: Execute the following two orders:
     ```
     mysql> set global sql_mode='STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION';
     
     mysql> set session sql_mode='STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION';
     ```
  • Delete the PEP table and rename the TMP table PEP

    drop table pep;
    alter table tmp rename to pep;
  • Look at the desc structure and select * from pep to find that the field type of ID has changed, which needs to be changed back to the original type.

    alter table pep add primary key (id);
    alter table pep modify id int auto_increment;

There is also the use of join to do heavy, faster can also add a field (can be a few fields + the MD5 value), to create a unique index for this field, after inserting data, automatically filter back duplicate data.