Tamping series: front end large file upload

Time:2022-6-6

One day, I suddenly saw this article when I was browsing a gold mine,Front end large file uploadI have studied similar principles before, but I haven’t been able to do it once by myself. I always feel a little empty. Recently, I spent some time to prepare (explode) an example to share with you.

Article code:github

upload

problem

Knowing the time available to provide a response can avoid problems with timeouts. Current implementations select times between 30 and 120 seconds

https://tools.ietf.org/id/draft-thomson-hybi-http-timeout-00.html

If a file is too large, such as audio and video data, downloaded excel tables, etc., if the waiting time during uploading exceeds 30 ~ 120s and the server does not return data, it may be considered as timeout, which means that the uploaded file will be interrupted.

Another problem is that in the process of uploading large files, the data uploaded to the server is interrupted or timed out due to server problems or other network problems. This is because the uploaded data will not be saved, resulting in a waste of uploading.

principle

Large file uploading uses the principle of dividing large files into pieces to split a large file into several small files and upload them respectively. Then, after the small file upload is completed, the server is notified to merge the files. Thus, the large file upload is completed.

This method of uploading solves several problems:

  • Request timeout due to too large file
  • Split a request into multiple requests (the number of popular browsers is generally 6 by default,Number of concurrent uploads of homologous requests), increase the number of concurrency, and improve the speed of file transfer
  • The data of the small file is convenient for the server to save. In case of network interruption, the uploaded data can not be uploaded again when uploading the next time

realization

File slicing

FileInterface is based onBlobSo we can use the uploaded file objectsliceMethod. The specific implementation is as follows:

export const slice = (file, piece = CHUNK_SIZE) => {
  return new Promise((resolve, reject) => {
    let totalSize = file.size;
    const chunks = [];
    const blobSlice = File.prototype.slice || File.prototype.mozSlice || File.prototype.webkitSlice;
    let start = 0;
    const end = start + piece >= totalSize ? totalSize : start + piece;

    while (start < totalSize) {
        const chunk = blobSlice.call(file, start, end);
        chunks.push(chunk);

        start = end;
        const end = start + piece >= totalSize ? totalSize : start + piece;
    }
    
    resolve(chunks);
  });
};

Then upload each small file in the form

_chunkUploadTask(chunks) {
    for (let chunk of chunks) {
        const fd = new FormData();
        fd.append('chunk', chunk);

        return axios({
          url: '/upload',
          method: 'post',
          data: fd,
        })
          .then((res) => res.data)
          .catch((err) => {});
    }
}

Back end usesexpress, receiving documents adopt[multer](https://github.com/expressjs/multer)This library

multerThe upload methods include single, array, fields, none, and any. Single file upload is performed by usingsingleandarrayBoth are available. It is easy to use. Throughreq.fileorreq.filesTo get the information of the uploaded file

In addition, it needs to passdisk storageTo customize the file name of the uploaded file, and ensure that each uploaded file chunk is unique.

const storage = multer.diskStorage({
  destination: uploadTmp,
  filename: (req, file, cb) => {
    //Specify the returned file name. If it is not specified, it will be generated randomly by default
    cb(null, file.fieldname);
  },
});
const multerUpload = multer({ storage });

// router
router.post('/upload', multerUpload.any(), uploadService.uploadChunk);

// service
uploadChunk: async (req, res) => {
  const file = req.files[0];
  const chunkName = file.filename;

  try {
    const checksum = req.body.checksum;
    const chunkId = req.body.chunkId;

    const message = Messages.success(modules.UPLOAD, actions.UPLOAD, chunkName);
    logger.info(message);
    res.json({ code: 200, message });
  } catch (err) {
    const errMessage = Messages.fail(modules.UPLOAD, actions.UPLOAD, err);
    logger.error(errMessage);
    res.json({ code: 500, message: errMessage });
    res.status(500);
  }
}

The uploaded file will be saved inuploads/tmpNext, here is themulterIt is automatically completed for us. After success, it can be passedreq.filesIt can obtain file information, including the name and path of the chunk, so as to facilitate subsequent repository processing.

Why should we ensure that the file name of chunk is unique?

  • Because the file name is random, it means that in case of network interruption, if the uploaded partition has not been completed, the database will not have the corresponding storage record, so that the partition will not be found in the next upload. The consequence is thattmpThere are many free partitions in the directory, which cannot be deleted.
  • At the same time, when uploading is suspended, the corresponding temporary partition can be deleted according to the name of the chunk (this step is unnecessary,multerWhen the partition is judged to exist, it will be automatically overwritten)

There are two ways to ensure that chunk is unique,

  • When cutting files, generate file fingerprints for each chunk(chunkmd5)
  • It is specified by the file fingerprint of the whole file plus the serial number of chunk(filemd5 + chunkIndex
//Modify the above code
const chunkName = `${chunkIndex}.${filemd5}.chunk`;
const fd = new FormData();
fd.append(chunkName, chunk);

So far, the slicing upload is roughly completed.

File merge

File merging is to read the uploaded files separately and then integrate them into a new file,Compare IO consumption, can be integrated in a new thread.

for (let chunkId = 0; chunkId < chunks; chunkId++) {
  const file = `${uploadTmp}/${chunkId}.${checksum}.chunk`;
  const content = await fsPromises.readFile(file);
  logger.info(Messages.success(modules.UPLOAD, actions.GET, file));
  try {
    await fsPromises.access(path, fs.constants.F_OK);
    await appendFile({ path, content, file, checksum, chunkId });
    if (chunkId === chunks - 1) {
        res.json({ code: 200, message });
    }
  } catch (err) {
    await createFile({ path, content, file, checksum, chunkId });
  }
}

Promise.all(tasks).then(() => {
  // when status in uploading, can send /makefile request
  // if not, when status in canceled, send request will delete chunk which has uploaded.
  if (this.status === fileStatus.UPLOADING) {
    const data = { chunks: this.chunks.length, filename, checksum: this.checksum };
    axios({
      url: '/makefile',
      method: 'post',
      data,
    })
      .then((res) => {
        if (res.data.code === 200) {
          this._setDoneProgress(this.checksum, fileStatus.DONE);
          toastr.success(`file ${filename} upload successfully!`);
        }
      })
      .catch((err) => {
        console.error(err);
        toastr.error(`file ${filename} upload failed!`);
      });
  }
});
  • First, use access to determine whether the partition exists. If it does not exist, create a new file and read the partition content
  • If the chunk file exists, read the contents into the file
  • After each chunk is read successfully, delete the chunk

Here are a few points to note:

  • If a file has only one chunk, you need tocreateFileOtherwise, the request is always in thependingStatus.

    await createFile({ path, content, file, checksum, chunkId });
    
    if (chunks.length === 1) {
      res.json({ code: 200, message });
    }
  • makefileIt is necessary to judge whether the file is in the upload status, otherwisecancelIn the status of, it will continue to upload. As a result, after the chunk is uploaded, the chunk file is deleted, but there are records in the database. Therefore, the merged file is problematic.

File second transmission

miao

How to transmit documents every second, think for three seconds, and publish the answers, 3 2. 1….., It’s just a cover up.

Why is it a cover up? Because there is no transfer at all. The files come from the server. There are several problems to be clarified,

  • How can I confirm that a file already exists in the server?
  • Is the uploaded file information stored in the database or in the client?
  • What should I do if the file names are different and the contents are the same?

Question 1: how to judge that the file already exists?

The corresponding fingerprint can be generated for each file upload, but if the file is too large, the time for the client to generate the fingerprint will greatly increase. How to solve this problem?

Remember beforeslice, file slicing? Large files are not easy to do. In the same way, cut them into small files, and then calculate the MD5 value. Used herespark-md5This library generates the file hash. Modify the slice method above.

export const checkSum = (file, piece = CHUNK_SIZE) => {
  return new Promise((resolve, reject) => {
    let totalSize = file.size;
    let start = 0;
    const blobSlice = File.prototype.slice || File.prototype.mozSlice || File.prototype.webkitSlice;
    const chunks = [];
    const spark = new SparkMD5.ArrayBuffer();
    const fileReader = new FileReader();

    const loadNext = () => {
      const end = start + piece >= totalSize ? totalSize : start + piece;
      const chunk = blobSlice.call(file, start, end);

      start = end;
      chunks.push(chunk);
      fileReader.readAsArrayBuffer(chunk);
    };

    fileReader.onload = (event) => {
      spark.append(event.target.result);

      if (start < totalSize) {
        loadNext();
      } else {
        const checksum = spark.end();
        resolve({ chunks, checksum });
      }
    };

    fileReader.onerror = () => {
      console.warn('oops, something went wrong.');
      reject();
    };

    loadNext();
  });
};

Question 2: is the uploaded information stored in the database or in the client?

The information uploaded from the file should be stored in theServer sideIn the database (the client can useIndexDB)This has several advantages,

  • Database service provides a complete set ofCRUDTo facilitate data operation
  • After the user refreshes the browser or changes the browser, the information uploaded by the file will not be lost

The second point is emphasized here, because the first client can also do

const saveFileRecordToDB = async (params) => {
  const { filename, checksum, chunks, isCopy, res } = params;
  await uploadRepository.create({ name: filename, checksum, chunks, isCopy });

  const message = Messages.success(modules.UPLOAD, actions.UPLOAD, filename);
  logger.info(message);
  res.json({ code: 200, message });
};

Question 3: what should I do if the file names are different and the contents are the same?

There are also two solutions:

  • File copy, directly copy a file, and then update the database records, and addisCopyIdentification of
  • File references, database records, plusisCopyandlinkToIdentification of

What is the difference between the two methods:

Using the file copy method, you can delete files more freely, because the original files and copied files exist independently, and the deletion will not interfere with each other. The disadvantage is that there will be many files with the same content;

However, it is troublesome to delete files copied by reference. It is better if the copied files are deleted. If the original files are deleted,You must copy a copy of the source file to any copy fileAt the same time, modify theisCopybyfalseBefore deleting the database record of the original file.

Here is a picture. By the way:

fileCopy

In theory, the way of file reference may be better. Here is a lazy job. The way of file copying is adopted.

//Client
uploadFileInSecond() {
  const id = ID();
  const filename = this.file.name;
  this._renderProgressBar(id);

  const names = this.serverFiles.map((file) => file.name);
  if (names.indexOf(filename) === -1) {
    const sourceFilename = names[0];
    const targetFilename = filename;

    this._setDoneProgress(id, fileStatus.DONE_IN_SECOND);
    axios({
      url: '/copyfile',
      method: 'get',
      params: { targetFilename, sourceFilename, checksum: this.checksum },
    })
      .then((res) => {
        if (res.data.code === 200) {
          toastr.success(`file ${filename} upload successfully!`);
        }
      })
      .catch((err) => {
        console.error(err);
        toastr.error(`file ${filename} upload failed!`);
      });
  } else {
    this._setDoneProgress(id, fileStatus.EXISTED);
    toastr.success(`file ${filename} has existed`);
  }
}

//Server side
copyFile: async (req, res) => {
  const sourceFilename = req.query.sourceFilename;
  const targetFilename = req.query.targetFilename;
  const checksum = req.query.checksum;
  const sourceFile = `${uploadPath}/${sourceFilename}`;
  const targetFile = `${uploadPath}/${targetFilename}`;

  try {
    await fsPromises.copyFile(sourceFile, targetFile);
    await saveFileRecordToDB({ filename: targetFilename, checksum, chunks: 0, isCopy: true, res });
  } catch (err) {
    const message = Messages.fail(modules.UPLOAD, actions.UPLOAD, err.message);
    logger.info(message);
    res.json({ code: 500, message });
    res.status(500);
  }
}

File upload suspension and file renewal

The suspension of file uploading actually takes advantage ofxhrofabortMethod, because in the caseaxiosaxiosbe based onajaxEncapsulates its own implementation.

Here is the code pause Code:

const CancelToken = axios.CancelToken;

axios({
  url: '/upload',
  method: 'post',
  data: fd,
  cancelToken: new CancelToken((c) => {
    // An executor function receives a cancel function as a parameter
    canceler = c;
    this.cancelers.push(canceler);
  }),
})

axiosOne parameter is used in each requestcancelToken, thiscancelTokenIs a function that can be used to save thecancelHandle.

Then click Cancel to cancel the upload of each chunk, as follows:

//JQuery is used here to write HTML. Well, it does

$(`#cancel${id}`).on('click', (event) => {
  const $this = $(event.target);
  $this.addClass('hidden');
  $this.next('.resume').removeClass('hidden');

  this.status = fileStatus.CANCELED;
  if (this.cancelers.length > 0) {
    for (const canceler of this.cancelers) {
      canceler();
    }
  }
});

While uploading each chunk, we also need to determine whether each chunk exists? Why?

In case of unexpected network interruption, the information uploaded to the chunk will be saved to the database. Therefore, the existing chunks can be saved without further transmission during the subsequent transmission, which saves time.

So the question is, is it a single detection for each chunk or a pre detection for the existing chunks in the server?

You can also think about this problem for three seconds. After all, it has been debugged for a long time.

3.. 2.. 1……

Look at your code strategy, because after all, everyone writes code differently. The principle is,You cannot block each loop because you need to generate thecancelToken, if each chunk needs to get data from the server during the cycle, subsequent chunks will not be able to generate a canceltoken, so that subsequent chunks can continue to upload when you click Cancel.

//Client
const chunksExisted = await this._isChunksExists();

for (let chunkId = 0; chunkId < this.chunks.length; chunkId++) {
  const chunk = this.chunks[chunkId];
  //The code was like this a long time ago
  //Generation of canceltoken will be blocked here
  // const chunkExists = await isChunkExisted(this.checksum, chunkId);

  const chunkExists = chunksExisted[chunkId];

  if (!chunkExists) {
    const task = this._chunkUploadTask({ chunk, chunkId });
    tasks.push(task);
  } else {
    // if chunk is existed, need to set the with of chunk progress bar
    this._setUploadingChunkProgress(this.checksum, chunkId, 100);
    this.progresses[chunkId] = chunk.size;
  }
}

//Server side
chunksExist: async (req, res) => {
  const checksum = req.query.checksum;
  try {
    const chunks = await chunkRepository.findAllBy({ checksum });
    const exists = chunks.reduce((cur, chunk) => {
      cur[chunk.chunkId] = true;
      return cur;
    }, {});
    const message = Messages.success(modules.UPLOAD, actions.CHECK, `chunk ${JSON.stringify(exists)} exists`);
    logger.info(message);
    res.json({ code: 200, message: message, data: exists });
  } catch (err) {
    const errMessage = Messages.fail(modules.UPLOAD, actions.CHECK, err);
    logger.error(errMessage);
    res.json({ code: 500, message: errMessage });
    res.status(500);
  }
}

File renewal means uploading files again. There is nothing to say about this. The main thing is to solve the above problem.

$(`#resume${id}`).on('click', async (event) => {
  const $this = $(event.target);
  $this.addClass('hidden');
  $this.prev('.cancel').removeClass('hidden');

  this.status = fileStatus.UPLOADING;
  await this.uploadFile();
});

Progress feedback

Progress feedback is usedXMLHttpRequest.uploadaxiosThe corresponding method is also encapsulated, and two progress items need to be displayed here

  • Progress per chunk
  • Total progress of all chunks

The progress of each chunk will be based on the uploadedloadedandtotalThere is nothing to say here.

axios({
  url: '/upload',
  method: 'post',
  data: fd,
  onUploadProgress: (progressEvent) => {
    const loaded = progressEvent.loaded;
    const chunkPercent = ((loaded / progressEvent.total) * 100).toFixed(0);

    this._setUploadingChunkProgress(this.checksum, chunkId, chunkPercent);
  },
})

The total progress is accumulated according to the loading amount of each chunk, and thenfile.sizeTo calculate.

constructor(checksum, chunks, file) {
  this.progresses = Array(this.chunks.length).fill(0);
}

axios({
  url: '/upload',
  method: 'post',
  data: fd,
  onUploadProgress: (progressEvent) => {
    const chunkProgress = this.progresses[chunkId];
    const loaded = progressEvent.loaded;
    this.progresses[chunkId] = loaded >= chunkProgress ? loaded : chunkProgress;
    const percent = ((this._getCurrentLoaded(this.progresses) / this.file.size) * 100).toFixed(0);

    this._setUploadingProgress(this.checksum, percent);
  },
})

_setUploadingProgress(id, percent) {
  // ...

  // for some reason, progressEvent.loaded bytes will greater than file size
  const isUploadChunkDone = Number(percent) >= 100;
  // 1% to make file
  const ratio = isUploadChunkDone ? 99 : percent;
}

One thing to note here is that,loaded >= chunkProgress ? loaded : chunkProgressThe purpose of this judgment is that some films may need to be replayed during the continuation process**0**Start uploading. If you don’t judge this way, it will cause the progress bar to jump.

Database configuration

Database usessequelize + mysql, the initialization code is as follows:

const initialize = async () => {
  // create db if it doesn't already exist
  const { DATABASE, USER, PASSWORD, HOST } = config;
  const connection = await mysql.createConnection({ host: HOST, user: USER, password: PASSWORD });
  try {
    await connection.query(`CREATE DATABASE IF NOT EXISTS ${DATABASE};`);
  } catch (err) {
    logger.error(Messages.fail(modules.DB, actions.CONNECT, `create database ${DATABASE}`));
    throw err;
  }

  // connect to db
  const sequelize = new Sequelize(DATABASE, USER, PASSWORD, {
    host: HOST,
    dialect: 'mysql',
    logging: (msg) => logger.info(Messages.info(modules.DB, actions.CONNECT, msg)),
  });

  // init models and add them to the exported db object
  db.Upload = require('./models/upload')(sequelize);
  db.Chunk = require('./models/chunk')(sequelize);

  // sync all models with database
  await sequelize.sync({ alter: true });
};

deploy

The deployment of production environment adoptsdocker-compose, code as follows:

Dockerfile

FROM node:16-alpine3.11

# Create app directory
WORKDIR /usr/src/app

# A wildcard is used to ensure both package.json AND package-lock.json are copied
# where available ([email protected]+)
COPY package*.json ./

# If you are building your code for production
# RUN npm ci --only=production

# Bundle app source
COPY . .

# Install app dependencies
RUN npm install
RUN npm run build:prod

docker-compose.yml

version: "3.9"
services:
  web:
    build: .
    # sleep for 20 sec, wait for database server start
    command: sh -c "sleep 20 && npm start"
    ports:
      - "3000:3000"
    environment:
      NODE_ENV: prod
    depends_on:
      - db
  db:
    image: mysql:8
    command: --default-authentication-plugin=mysql_native_password
    restart: always
    ports:
      - "3306:3306"
    environment:
      MYSQL_ROOT_PASSWORD: pwd123

One thing to note is that you need to wait for the database service to start before starting itwebService, otherwise an error will be reported, so a 20 second delay is added to the code.

Deploy to heroku

  1. create heroku.yml

    build:
      docker:
        web: Dockerfile
    run:
      web: npm run start:heroku
  2. modify package.json

    {
      "scripts": {
        "start:heroku": "NODE_ENV=heroku node ./bin/www"
      }
    }
  3. deploy to heroku

    # create heroku repos
    heroku create upload-demos
    heroku stack:set container 
    
    # when add addons, remind to config you billing card in heroku [important]
    # add mysql addons
    heroku addons:create cleardb:ignite 
    # get mysql connection url
    heroku config | grep CLEARDB_DATABASE_URL
    # will echo => DATABASE_URL: mysql://xxxxxxx:[email protected]/heroku_9ab10c66a98486e?reconnect=true
    
    # set mysql database url
    heroku config:set DATABASE_URL='mysql://xxxxxxx:[email protected]/heroku_9ab10c66a98486e?reconnect=true'
    
    # add heroku.js to src/db/config folder
    # use the DATABASE_URL which you get form prev step to config the js file
    module.exports = {
      HOST: 'xx-xxxx-east-xx.cleardb.com',
      USER: 'xxxxxxx',
      PASSWORD: 'xxxxxx',
      DATABASE: 'heroku_9ab10c66a98486e',
    };
    
    # push source code to remote
    git push heroku master

Summary

So far, all the problems have been solved. The general feeling is that there are so many details to be dealt with. Some things can’t just be seen. It takes time to do them so that we can better understand the principles and have more motivation to learn new knowledge.

You never know what you have to do.

In the code warehousegithubThere are also many details, including local server development configuration, log storage, etc. those interested canforkUnderstand. It is not easy to create ⭐ ️ ⭐ ️。

Recommended Today

Why is reids fast

1. What is redis? Redis is completely open source and complies with the BSD protocol. It is a high-performance key value database. Redis is also one of the most popular NoSQL databases at present. It contains a variety of data structures, supports network, is memory based, and has an optional key value pair storage database […]