Explain how PHP reads large files

Time:2022-6-15
catalogue
  • Measure success
  • What options do we have?
  • Read file line by line
  • Pipes between files
  • Other streams
  • filter
  • Custom flow
  • Create custom protocols and filters
  • summary

Measure success

The only way to confirm whether the improvements we have made to the code are effective is to measure a bad situation and then compare the measures we have applied. In other words, unless we know to what extent (if any) the “solution” can help us, we do not know whether it is a solution.

We can focus on two indicators. The first is CPU utilization. How fast or slow is the process we are dealing with running? The second is memory usage. How much memory is used for script execution? These are usually inversely proportional – which means that we can reduce memory usage at the cost of CPU usage, and vice versa.

In an asynchronous processing model, such as a multiprocess or multithreaded PHP application, both CPU and memory usage are important considerations. In the traditional PHP architecture, these usually become a problem whenever the server limit is reached.

It is difficult to measure the CPU utilization inside PHP. If you really care about this, consider using a command like top in Ubuntu or MacOS. For windows, consider using the Linux subsystem so that you can use the top command in Ubuntu.

In this tutorial, we will measure memory usage. We will look at how much memory is used by “traditional” scripts. We will also implement some optimization strategies and measure them. Finally, I hope you can make a reasonable choice.

Here are the methods we use to view memory usage:

//The formatbytes method is based on php Net document
memory_get_peak_usage();
function formatBytes($bytes, $precision = 2) {
    $units = array("b", "kb", "mb", "gb", "tb");
    $bytes = max($bytes, 0);
    $pow = floor(($bytes ? log($bytes) : 0) / log(1024));
    $pow = min($pow, count($units) - 1);
    $bytes /= (1 << (10 * $pow));
    return round($bytes, $precision) . " " . $units[$pow];
}

We will use these methods at the end of the script so that we know which script uses the most memory at one time.

What options do we have?

There are many ways to read files effectively. There are two scenarios that use them. We may want to read and process all data at the same time, output the processed data or perform other operations. We may also want to transform the data flow without having to access the data.

Imagine the following. For the first case, if we want to read the file and send every 10000 rows of data to a separate queue for processing. We need to load at least 10000 rows of data into memory, and then hand them over to the queue manager (whichever is used).

For the second case, suppose we want to compress the contents of an API response, which is particularly large. Although we don’t care what its contents are here, we need to ensure that it is backed up in a compressed format.

In both cases, we need to read large files. The difference is that in the first case, we need to know what the data is, while in the second case, we don’t care what the data is. Next, let’s discuss these two approaches in depth

Read file line by line

PHP has many functions for processing files. Let’s combine some of these functions to implement a simple file reader


// from memory.php
function formatBytes($bytes, $precision = 2) {
    $units = array("b", "kb", "mb", "gb", "tb");
    $bytes = max($bytes, 0);
    $pow = floor(($bytes ? log($bytes) : 0) / log(1024));
    $pow = min($pow, count($units) - 1);
    $bytes /= (1 << (10 * $pow));
    return round($bytes, $precision) . " " . $units[$pow];
}
print formatBytes(memory_get_peak_usage());
// from reading-files-line-by-line-1.php
function readTheFile($path) {
    $lines = [];
    $handle = fopen($path, "r");
    while(!feof($handle)) {
        $lines[] = trim(fgets($handle));
    }
    fclose($handle);
    return $lines;
}
readTheFile("shakespeare.txt");
require "memory.php";

We are reading a text file containing all the works of Shakespeare. The file size is approximately 5.5 MB. The peak memory usage is 12.8 MB. Now, let’s use the generator to read each row:


// from reading-files-line-by-line-2.php
function readTheFile($path) {
    $handle = fopen($path, "r");
    while(!feof($handle)) {
        yield trim(fgets($handle));
    }
    fclose($handle);
}
readTheFile("shakespeare.txt");
require "memory.php";

The file size is the same, but the peak memory usage is 393 KB. This data is of little significance because we need to add the processing of file data. For example, when two blank lines appear, split the document into blocks:


// from reading-files-line-by-line-3.php
$iterator = readTheFile("shakespeare.txt");
$buffer = "";
foreach ($iterator as $iteration) {
    preg_match("/\n{3}/", $buffer, $matches);
    if (count($matches)) {
        print ".";
        $buffer = "";
    } else {
        $buffer .= $iteration . PHP_EOL;
    }
}
require "memory.php";

Does anyone guess how much memory will be used this time? Even if we split the text document into 126 blocks, we still use only 459 KB of memory. Given the nature of the generator, the maximum memory we will use is the memory that needs to store the largest block of text in the iteration. In this case, the largest block is 101985 characters.

The generator has other uses, but obviously it can read large files very well. If we need to process the data, the generator may be the best method.

Pipes between files

We can transfer file data from one file to another without processing the data. This is often called a pipe (probably because we can’t see anything inside the pipe except the two ends, of course, as long as it’s opaque). We can do this through stream. First, we write a script to transfer one file to another so that we can measure the memory usage:


// from piping-files-1.php
file_put_contents(
    "piping-files-1.txt", file_get_contents("shakespeare.txt")
);
require "memory.php";

The result was not surprising. The script uses more memory to run than the text file it copies. This is because the script must read the entire file in memory until it writes to another file. This is OK for small files. This is not the case with large files.

Let’s try streaming (or pipelining) from one file to another:


// from piping-files-2.php
$handle1 = fopen("shakespeare.txt", "r");
$handle2 = fopen("piping-files-2.txt", "w");
stream_copy_to_stream($handle1, $handle2);
fclose($handle1);
fclose($handle2);
require "memory.php";

This code is a bit strange. We open handles to two files, the first in read mode and the second in write mode. Then we copied from the first to the second. We do this by closing both files again. You may be surprised to know that the memory usage is 393 KB. This number looks familiar. This is the memory used by the generator to save the read contents line by line. This is because the second parameter of fgets defines the number of bytes to be read per row (the default is-1Or the length before reaching the new line). stream_ copy_ to_ The third parameter of stream is the same (the default value is exactly the same). stream_ copy_ to_ Stream reads a row from one stream at a time and writes it to another stream. Since we do not need to process the value, it skips the part of the generator that produces the value

Text transmission alone is not practical enough, so consider other examples. Suppose we want to output an image from the CDN, we can use the following code to describe it


// from piping-files-3.php
file_put_contents(
    "piping-files-3.jpeg", file_get_contents(
        "https://github.com/assertchris/uploads/raw/master/rick.jpg"
    )
);
// ...or write this straight to stdout, if we don't need the memory info
require "memory.php";

Imagine the application level performing this step. This time we are not going to get images from the local file system, but from the CDN. We use file_ get_ Contents instead of more elegant processing methods (such as guzzle), their actual effects are the same.

The memory usage is 581kb. Now, how do we try to stream?


// from piping-files-4.php
$handle1 = fopen(
"https://github.com/assertchris/uploads/raw/master/rick.jpg", "r"
);
$handle2 = fopen(
"piping-files-4.jpeg", "w"
);
// ...or write this straight to stdout, if we don't need the memory info
stream_copy_to_stream($handle1, $handle2);
fclose($handle1);
fclose($handle2);
require "memory.php";

The memory usage is slightly less than that of just before (400 KB), but the result is the same. If we do not need memory information, we can also print to standard output. PHP provides a simple way to do this:


$handle1 = fopen(
"https://github.com/assertchris/uploads/raw/master/rick.jpg", "r"
);
$handle2 = fopen(
"php://stdout", "w"
);
stream_copy_to_stream($handle1, $handle2);
fclose($handle1);
fclose($handle2);
// require "memory.php";

Other streams

There are also streams that can be read and written through pipes.

  • php://stdin read-only
  • php://stderr Write only, and php://stdout be similar
  • php://input Read only, allowing us to access the original request content
  • php://output Write only allows us to write to the output buffer
  • php://memory And php://temp (read / write) is a place where data is temporarily stored. The difference is that when the data is large enough, php:/// temp will store the data in the file system, while php:/// memory will continue to store it in memory until it is exhausted.

filter

We can use another technique called filters for convection. It is in the middle of the two. It controls the data properly so that it is not exposed to the outside world. Suppose we want to compress shakespeare Txt file. We can use zip extension


// from filters-1.php
$zip = new ZipArchive();
$filename = "filters-1.zip";
$zip->open($filename, ZipArchive::CREATE);
$zip->addFromString("shakespeare.txt", file_get_contents("shakespeare.txt"));
$zip->close();
require "memory.php";

This code is neat, but it uses about 10.75 MB of memory in total. We can use filters to optimize


// from filters-2.php
$handle1 = fopen(
"php://filter/zlib.deflate/resource=shakespeare.txt", "r"
);
$handle2 = fopen(
"filters-2.deflated", "w"
);
stream_copy_to_stream($handle1, $handle2);
fclose($handle1);
fclose($handle2);
require "memory.php";

Here, we can see php:///filter/zlib.deflate A filter that reads and compresses the contents of a resource. We can then pipe the compressed data to another file. This uses only 896kb of memory.

Although the format is different, or using zip to compress files has many other advantages. However, you have to consider: if you can save 12 times of memory by choosing other formats, will you be excited?

To decompress the data, simply use another zlib filter:


// from filters-2.php
file_get_contents(
    "php://filter/zlib.inflate/resource=filters-2.deflated"
);

Custom flow

Fopen and file_ get_ Contents have their own default set of options, but they are fully customizable. To define them, we need to create a new flow context


// from creating-contexts-1.php
$data = join("&", [
    "twitter=assertchris",
]);
$headers = join("\r\n", [
    "Content-type: application/x-www-form-urlencoded",
    "Content-length: " . strlen($data),
]);
$options = [
    "http" => [
        "method" => "POST",
        "header"=> $headers,
        "content" => $data,
    ],
];
$context = stream_content_create($options);
$handle = fopen("https://example.com/register", "r", false, $context);
$response = stream_get_contents($handle);
fclose($handle);

In this example, we try to send a post request to the API. The API endpoint is secure, but we still use the HTTP context attribute (available for HTTP or HTTPS). We set some headers and opened the file handle of the API. We can open the handle in read-only mode, and the context is responsible for writing.

Create custom protocols and filters

Before concluding, let’s talk about creating custom protocols.


Protocol {
    public resource $context;
    public __construct ( void )
    public __destruct ( void )
    public bool dir_closedir ( void )
    public bool dir_opendir ( string $path , int $options )
    public string dir_readdir ( void )
    public bool dir_rewinddir ( void )
    public bool mkdir ( string $path , int $mode , int $options )
    public bool rename ( string $path_from , string $path_to )
    public bool rmdir ( string $path , int $options )
    public resource stream_cast ( int $cast_as )
    public void stream_close ( void )
    public bool stream_eof ( void )
    public bool stream_flush ( void )
    public bool stream_lock ( int $operation )
    public bool stream_metadata ( string $path , int $option , mixed $value )
    public bool stream_open ( string $path , string $mode , int $options ,
        string &$opened_path )
    public string stream_read ( int $count )
    public bool stream_seek ( int $offset , int $whence = SEEK_SET )
    public bool stream_set_option ( int $option , int $arg1 , int $arg2 )
    public array stream_stat ( void )
    public int stream_tell ( void )
    public bool stream_truncate ( int $new_size )
    public int stream_write ( string $data )
    public bool unlink ( string $path )
    public array url_stat ( string $path , int $flags )
}

We’re not going to implement one of them because I think it’s worth having your own tutorial. There is a lot of work to do. But once we’re done, we can easily register the stream wrapper:


if (in_array("highlight-names", stream_get_wrappers())) {
    stream_wrapper_unregister("highlight-names");
}
stream_wrapper_register("highlight-names", "HighlightNamesProtocol");
$highlighted = file_get_contents("highlight-names://story.txt");

Similarly, you can create custom flow filters.


Filter {
    public $filtername;
    public $params
    public int filter ( resource $in , resource $out , int &$consumed ,
        bool $closing )
    public void onClose ( void )
    public bool onCreate ( void )
}

Can be easily registered


$handle = fopen("story.txt", "w+");
stream_filter_append($handle, "highlight-names", STREAM_FILTER_READ);

The highlight names need to match the filtername attribute of the new filter class. You can also php:///filter/highligh-names/resource=story.txt Use custom filters in strings. Defining filters is much easier than defining protocols. One reason is that the protocol needs to handle directory operations, while the filter only needs to handle each data block.

If you prefer, I strongly recommend that you try to create custom protocols and filters. If you can apply filters to streams_ copy_ to_ Stream operation, your application will use almost no memory even if it handles annoying large files. Imagine writing a resizing image filter or an encryption application filter.

If you like, I strongly recommend that you try to create custom protocols and filters. If you can apply filters to streams_ copy_ to_ Stream operation, even when dealing with annoying large files, your application uses almost no memory. Imagine writing a resize image filter and an encrypt for application filter.

summary

Although this is not a problem we often encounter, it is easy to screw up when dealing with large files. In asynchronous applications, if we do not pay attention to the memory usage, it is easy to cause the server to crash.

This tutorial hopes to bring you some new ideas (or update your inherent memory of this aspect) so that you can think more about how to effectively read and write large files. When we become familiar with and use streams and generators and stop using things like file_ get_ When using a function like contents, all errors in this aspect will disappear from the application, which is a good thing.

The above is a detailed explanation of how PHP reads large files. For more information about how PHP reads large files, please follow other developeppaer related articles!

Recommended Today

. Net sort array Sort implementation example

catalogue Array.Sort ArraySortHelper GenericArraySortHelper IntroSort InsertionSort summary System. Array. Sort<t> yes Net built-in sorting method is flexible and efficient. Everyone has learned some sorting algorithms, such as bubble sort, insert sort, heap sort, etc. but do you know what sort algorithm is used behind this method? Let’s start with the results. In fact, array Sort […]