How does Tomcat handle file upload?

Time:2021-8-19

Like it first and then watch it. Form a good habit

preface

I saw a question about Tomcat in the Q & A area these two days, which is very interesting. I just haven’t thought about this before. Today I’ll talk about the “why” in combination with Tomcat mechanism.
How does Tomcat handle file upload?
In this paper, the analysis of file upload standard and Tomcat mechanism in HTTP protocol is more basic. If you don’t need it, you can jump to the end of the text directly.

File upload in HTTP protocol

As we all know, HTTP is aText protocol, how does the text protocol transfer files?

Direct transmission… Yes, it’s that simple. Text protocol is only from the perspective of application layer. When it comes to the transport layer, all data are bytes. There is no difference, and there is no need for additional encoding and decoding.

Multipart / form data mode

The HTTP protocol specifies aForm based file upload。 Define an enctype attribute in the form with the value of multipart / form data, and then add a file with the type of file<input>label.

 <FORM ENCTYPE="multipart/form-data" ACTION="_URL_" METHOD=POST>

   File to process: <INPUT NAME="userfile1" TYPE="file">

   <INPUT TYPE="submit" VALUE="Send File">

 </FORM>

This multipart / form data form is somewhat different from the default x-www-form-urlencoded form. Although they are all used as forms and can upload multiple fields, the former can upload files, while the latter can only transmit text

Now let’s take a look at the protocol of the form file upload method. The following figure is a simple multipart / form data type request message:
How does Tomcat handle file upload?
As can be seen from the above figure, there is no change in the HTTP header part, but a boundary tag is added in the content type, but the payload part is completely different

Boundary is used in multipart / form data to separate multiple fields of the form. In the payload section, there is a boundary in the first and last lines, and there will be a boundary between each field (part / item)

When the server side reads, it only needs to get the boundary from the content type first, and then split the payload part through the boundary to get all the fields.

In the message of each field, there is a content disposition field as the header part of this field. The current field name (name) is recorded. If it is a file, there will be a filename attribute, and a content type will be attached to the next line to identify the file type

Although both x-www-form-urlencoded and multipart forms can transfer fields, multipart can transfer not only text fields, but also files. Moreover, the multipart file transfer method is also “standard”, which can be supported by various servers to read files directly.

X-www-form-urlencoded can only transmit basic text data. However, if you force the file as text, no one can stop you from transmitting it with this type, but when it is transmitted as text, the back end must be parsed in the form of string. The coding overhead in byte – > STR is completely unnecessary, and may lead to coding errors

In the x-www-form-urlencoded message, if there is no boundary, multiple fields will pass&Symbol splicing, and URLEncode the key / value
How does Tomcat handle file upload?
Although x-www-form-urlencoded adds a one-step encoding process, it does not add a header to each field, nor does it have a boundary. The message volume is much smaller than that of multipart.

In addition to this multipart, there is also a form of directly uploading files, but it is not commonly used

Binary payload mode

In addition to multipart / form data, there is also a binary payload upload method. This binary payload is my own name… Because the description of this method is not found in the HTTP protocol (if there is a connection posted in the boss comment area), but many HTTP clients support it.

For example, postman:
How does Tomcat handle file upload?
For example, okhttp:

OkHttpClient client = new OkHttpClient().newBuilder()
  .build();
MediaType mediaType = MediaType.parse("image/png");
RequestBody body = RequestBody.create(mediaType, "<file contents here>");
Request request = new Request.Builder()
  .url("localhost:8098/upload")
  .method("POST", body)
  .addHeader("Content-Type", "image/png")
  .build();
Response response = client.newCall(request).execute();

This method is very simple, that is, the whole payload part is used to store file data. As shown in the following figure, the entire payload part is the file content:
How does Tomcat handle file upload?

Although this method is simple and the client implementation is simple, the server does not have good support. For example, Tomcat does not treat this binary file as a file, but as an ordinary message.

Analysis of Tomcat processing mechanism

When Tomcat processes a message in text form, it will first read the previous header part and parse the content length to divide the message boundary. The remaining payload part will not be read at one time, but wrap an InputStream and call socket read internally to read the RCV_ BUF data(When the full message size is greater than readbuf size
How does Tomcat handle file upload?

When calling getparameter / getinputstream on HttpServletRequest and other read operations involving payload, socket RCV in InputStream will be performed_ Read buf and read payload data.

suchInstead of reading all data at one time and temporarily storing it in memory, wrap an InputStream to read RCV internally_ BUF modeThe feature is that it does not store data, but only makes a package. The read operation of the application layer on the servletrequest#inputstream will be forwarded to the socket RCV_ BUF read.

However, if the application layer reads the ServletRequest #inputstream completely, then converts the string and stores it in memory, it has nothing to do with Tomcat.

Tomcat has a special processing mechanism for multipart requests. Since multipart is designed to transfer files, Tomcat adds the concept of a temporary file when processing this type of request,When parsing the message, the data in the multipart is written to the disk

As shown in the figure below, Tomcat wraps each field as a diskfileitem-org.apache.tomcat.util.http.fileupload.disk.DiskFileItem(this diskfileitem does not distinguish between file and text data). Diskfileitem is divided into header part and content part. Part of the content is stored in memory and the rest is stored on disk, which is divided by a sizethreshold;However, this value defaults to 0In other words, all contents will be stored to disk by default.
How does Tomcat handle file upload?
Since it is stored on the disk, it must also be read from the disk… The efficiency is naturally relatively low. Therefore, if only text messages are transmitted, do not use multipart type. This type will be transferred to disk.

Another cold knowledge is that when Tomcat processes multipart messages, if a field is not a file, it will add the key / value of this field to the parametermap, that is, these non file fields can be obtained through request.getparameter/getparametermap.

//org.apache.catalina.connector.Request#parseParts

if (part.getSubmittedFileName() == null) {
    String name = part.getName();
    String value = null;
    try {
        value = part.getString(charset.name());
    } catch (UnsupportedEncodingException uee) {
        // Not possible
    }
    ......
        parameters.addParameter(name, value);
}

You should know that this getparameter can only obtain form parameters (formparameters) and query parameters (querystring), but multipart is also a form, and there seems to be nothing wrong with obtaining parameters

A simple summary

Tomcat handles different types of requests:

  1. If the parameter is in get querystring mode (spell parameter on URL), all parameters are in the message header and will be read to memory at one time
  2. If it is a post type message, Tomcat will only read the header part, and the payload part will not actively read, but package the socket into an InputStream supply layer read

    1. Although x-www-form-urlencoded messages will not be read actively, many web frameworks (such as spring MVC) will call getparameter or start the read of InputStream to RCV_ BUF for reading
    2. The same is true for the binary payload mentioned above. Tomcat does not actively initiate the read operation. The application layer needs to call servletrequest#inputstream to read the RCV_ BUF data
    3. Multipart messages will not be read actively, and parsing / reading will be triggered only by calling httpservletrequest#getparts; Similarly, many web frameworks call getparts, so parsing is triggered

      Why write a temporary file first and wrap the InputStream directly to the application layer for reading?

      If the application layer does not (timely) read the RCV_ BUF, then when the received data is filled with RCV_ When buf, ACK will not be returned, and the data of the client will also be stored in SND_ In buf, data cannot be sent continuously when SND_ When the buf is filled by the application layer, the connection is blocked.

How does Tomcat handle file upload?
The following reasons are personal opinions without the support of official documents. If you have different opinions, please leave a message in the comment area for discussion

Multipart is generally used to transfer files, but the file size is usually much larger than the capacity of the socket buffer. Therefore, in order not to block the TCP connection, Tomcat will read the complete payload part at one time, and then store all the parts in it to disk (the header is in memory and the content is on disk).

The application layer only needs to read part data from the diskfileitem provided by Tomcat. In this way, it seems that although it is transferred to the next layer, RCV_ The data in buf can be consumed in time.

In terms of efficiency, the operation of transferring + saving disk must be much slower than not transferring, but RCV can be consumed in time_ BUF to ensure that the TCP connection is not blocked.

If multiple requests use the same TCP connection under http2 multiplexing, if RCV_ If buf is not consumed in time, all “logical HTTP connections” will be blocked

Then why don’t other types of messages need to be temporarily stored on disk?

Because the message is small, ordinary request messages will not be too large. The common ones are only a few K to dozens of K. moreover, for plain text messages, the reading operation must be timely and read all at one time. Unlike multipart messages, it is a combination of text and file, and it may also be multi file.

For example, after receiving the file, the server also needs to transfer the file to the object storage service of some cloud manufacturers. At this time, there are two transfer methods:

  1. Receive the full file data, store it in memory, and then call the SDK stored by the object.
  2. In stream mode, read servletrequest#inputstream and write to OutputStream of SDK

Mode 1, although RCV was read in time_ BUF, but the memory occupation is too large, it is easy to burst the memory, which is very unreasonable
In mode 2, although the memory consumption is very small (only one read buffer at most), RCV will be caused because both sides are networks while reading and writing_ BUF cannot be consumed in time.

Moreover, not only tomcat, but also jetty handles multipart in this way. Although other web servers haven’t seen it, I think they will handle it in this way.

reference resources

Original is not easy, unauthorized reprint is prohibited. If my article is helpful to you, please like / collect / pay attention to encourage and support it ❤❤❤❤❤❤

Recommended Today

Supervisor

Supervisor [note] Supervisor – H view supervisor command help Supervisorctl – H view supervisorctl command help Supervisorctl help view the action command of supervisorctl Supervisorctl help any action to view the use of this action 1. Introduction Supervisor is a process control system. Generally speaking, it can monitor your process. If the process exits abnormally, […]