An example of how to quickly insert 10 million level big data into MySQL

Time:2021-1-21

In the field of data analysis, database is our good helper. We can not only accept our query time, but also make further analysis on this basis. Therefore, we must insert data into the database. In practical applications, we often encounter tens of millions or even more data. If there is not a fast insertion method, it will get twice the result with half the effort and spend a lot of time.

In Alibaba’s Tianchi big data algorithm competition (pop music trend prediction), I encountered such a problem. Before optimizing database query and insertion, I spent a lot of unjust time. Before optimization, it took me more than 12 hours to insert 15 million pieces of data (using the most basic one by one insertion). This also prompted me to think about how to optimize the database insertion and query operation to improve efficiency.

In the process of continuous optimization, the performance has been greatly improved. In the process of downloading, playing and collecting more than 26000 songs from the database by time series query, the operation speed of query generation is improved, from the estimated more than 40 hours to more than one hour. In the aspect of database insertion, the performance has been greatly improved; in the test on the new data set, 54.9 million + data has been inserted in 20 minutes. Let’s share my experience.

The optimization process is divided into two steps. In the first step, the static reader reads data from the CSV file. When a certain amount of data is reached, it starts to insert the database program by multi thread. In the second step, it uses mysq to batch insert.

The first step is to read the file and start inserting multithreads

Here, reaching a certain amount is a problem that needs to be considered. In my experiment, I started to use 100W as the amount, but there was a new problem, Java heap memory overflow, and finally adopted 10W as the amount standard.

Of course, there can be other quantities, depending on which one you like.

import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
 
import preprocess.ImportDataBase;
 
public class MuiltThreadImportDB {
 
 /**
  *Java multithreading read large files and put them into storage
  * 
  * @param args
  */
 private static int m_record = 99999;
 private static BufferedReader br = null;
 private ArrayList<String> list;
 private static int m_thread = 0;
 static {
 try {
  br = new BufferedReader(
  new FileReader(
  "E:/tianci/IJCAI15 Data/data_format1/user_log_format1.csv"),8192);
 
 } catch (FileNotFoundException e) {
  e.printStackTrace();
 }
 try {
  br.readLine (); // remove the CSV header
 } catch (IOException e) {
  e.printStackTrace();
 }
 }
 
 public void start() {
 String line;
 int count = 0;
 list = new ArrayList<String>(m_record + 1);
 synchronized (br) {
  try {
 while ((line = br.readLine()) != null) {
  if (count < m_record) {
 list.add(line);
 count++;
  } else {
 list.add(line);
 count = 0;
 Thread t1 = new Thread(new MultiThread(list),Integer.toString(m_thread++));
 t1.start();
 list = new ArrayList<String>(m_record + 1);
  }
 }
 
 if (list != null) {
  Thread t1 = new Thread(new MultiThread(list),Integer.toString(m_thread++));
  t1.start();
 }
  } catch (IOException e) {
 e.printStackTrace();
  }
 }
 }
 
 public static void main(String[] args) {
 new MuiltThreadImportDB().start();
 } 
}

The second step is to use multithreading to batch insert data


class MultiThread implements Runnable {
 private ArrayList<String> list;
 
 public MultiThread(ArrayList<String> list) {
 this.list = list;
 }
 
 public void run() {
 try {
  ImportDataBase insert = new ImportDataBase(list);
  insert.start();
 } catch (FileNotFoundException e) {
  e.printStackTrace();
 }
 display(this.list);
 }
 
 public void display(List<String> list) {
 // for (String str : list) {
 // System.out.println(str);
 // }
 System.out.print(Thread.currentThread().getName() + " :");
 System.out.println(list.size());
 }
 
}

In batch operation, the preparestatement class of MySQL is used. Of course, the batch operation of statement class is also used, and the performance is not as good as the former. The former can achieve the insertion speed of 1W + per second, while the latter only has 2000 +;


public int insertUserBehaviour(ArrayList<String> sqls) throws SQLException {
 
 String sql = "insert into user_behaviour_log (user_id,item_id,cat_id,merchant_id,brand_id,time_stamp,action_type)"
 + " values(?,?,?,?,?,?,?)";
 preStmt = conn.prepareStatement(sql);
 for (int i = 0; i < sqls.size(); i++) {
  UserLog log =new UserLog(sqls.get(i));
  preStmt.setString(1, log.getUser_id());
  preStmt.setString(2, log.getItem_id());
  preStmt.setString(3, log.getCat_id());
  preStmt.setString(4, log.getMerchant_id());
  preStmt.setString(5, log.getBrand_id());
  preStmt.setString(6, log.getTimeStamp());
  preStmt.setString(7, log.getActionType());
  preStmt.addBatch();
  if ((i + 1) % 10000 == 0) {
 preStmt.executeBatch();
 conn.commit();
 preStmt.clearBatch();
  }
 }
 preStmt.executeBatch();
 conn.commit();
 return 1;
 }

Of course, we have also experimented with different MySQL storage engines, InnoDB and myism. The experimental results show that InnoDB is faster (about 3 times), which may be related to the new version of mysq. The author’s version of MySQL is 5.6.

Finally, summarize the methods to improve the insertion speed under the large amount of data.

Java code, the use of multi-threaded insert, and the use of batch submission.

In terms of database, do not use the index when creating the table structure, otherwise the index B + tree will be maintained after the insertion process; when modifying the storage engine, InnoDB is generally used by default (the default is OK for the new version, but it may be required for the old version).

The above is the whole content of this article, I hope to help you learn, and I hope you can support developer more.

Recommended Today

The eve of Domain Driven Design: object oriented thinking

object-oriented Object oriented technology is a kind of technology to solve the problemworldUnderstanding and abstract methods. thatobjectWhat is it?Object is the understanding and abstraction of the world, and the world is also called objectcreation. Understanding the world is more complicated, but the world is made up ofthingIt’s made up of.It is such a relationship that […]