Articles

Browse Articles » Technology » How do you Read Big Files with PHP?

How do you Read Big Files with PHP?

by Arthor Alberto

Generally PHP developers who do PHP Programming don’t need to worry about memory management as the PHP engine does a good job of cleaning up and the web server model of short-lived execution contexts means even the careless code has no long lasting effects. It is rare when we need to read large files on a small server. We will look into this in this article. To make improvement to our code we have to measure a bad situation and compare that measurement to another after we have applied out fix. Unless we know how much a solution helps us, we can’t know if it is really a solution or not. The are two ways to measure it. The CPU usage and memory usage. CPU usage means how fast or slow is the process we want to work on. Memory usage means how much memory does the script take to execute. CPU and memory usage are inversely proportional.

In this article, we are going to measure memory usage. We will look at how much memory is used in traditional scripts. We will implement some optimization strategies and measure those. In the end, you can make a choice.

The methods we will use to see how much memory is used are:-

// formatBytes is taken from the php.net documentation

memory_get_peak_usage();

function formatBytes($bytes, $precision = 2) {

$units = array("b", "kb", "mb", "gb", "tb");

$bytes = max($bytes, 0);

$pow = floor(($bytes ? log($bytes) : 0) / log(1024));

$pow = min($pow, count($units) - 1);

$bytes /= (1 << (10 * $pow));

return round($bytes, $precision) . " " . $units[$pow];}

We will use these functions at the end of our scripts and we will see which script uses the most memory at one time.

There are many ways to read files efficiently. There are also two scenarios in which we can use them. We want to read and process data all at the same time and outputting the processed data or performing other actions based on what we read. We also want to transform a stream of data without ever really needing access to the data.

The PHP developers who do PHP Web Development should know about these two scenarios. For the first scenario, let’s imagine that we want to be able to read a file and create separate queued processing jobs every 10,000 lines. We would need to keep at least 10,000 lines in memory and pass then along to the queued job manager.

For the second scenario, let’s imagine we want to compress the contents of a particularly large API response. We need to make sure it is backed up in a compressed form.

In both scenarios, we need to read large files. In first, we need to know what the data is and in the second we don’t care what the data is. Now we will explore these options.

Reading Files, Line by Line

There are many functions for working with files. We will combine a few into a naive file reader.

// from memory.php

function formatBytes($bytes, $precision = 2) {

$units = array("b", "kb", "mb", "gb", "tb");

$bytes = max($bytes, 0);

$pow = floor(($bytes ? log($bytes) : 0) / log(1024));

$pow = min($pow, count($units) - 1);

$bytes /= (1 << (10 * $pow));

return round($bytes, $precision) . " " . $units[$pow];}

print formatBytes(memory_get_peak_usage());

// from reading-files-line-by-line-1.php

function readTheFile($path) {

$lines = [];

$handle = fopen($path, "r");

while(!feof($handle)) {

$lines[] = trim(fgets($handle));

}

fclose($handle);

return $lines;}

readTheFile("phptutorial.txt");

require "memory.php";

Here we are reading a text file named “phptutorial.txt”. The text file is about 5.5 MB and the peak memory usage is 12.8 MB. Now we will use the generator to read each line.

// from reading-files-line-by-line-2.php

function readTheFile($path) {

$handle = fopen($path, "r");

while(!feof($handle)) {

yield trim(fgets($handle));

}

fclose($handle);}

readTheFile("phptutorial.txt");

require "memory.php";

You can a Hire PHP Developer if you are not able to understand the above PHP code and still want to follow this process. In above code, the text file is of the same size but the peak memory usage is 393 KB. There is no need to do anything until we do something with the data we are reading. We can also split the document into chunks whenever we see two blank lines.

// from reading-files-line-by-line-3.php

$iterator = readTheFile("phptutorial.txt");

$buffer = "";

foreach ($iterator as $iteration) {

preg_match("/\n{3}/", $buffer, $matches);

if (count($matches)) {

print ".";

$buffer = "";

} else {

$buffer .= $iteration . PHP_EOL;

}}

require "memory.php";

Even though we split the text document up into 1,216 chunks, we still use only 459 KB of memory. Given the type of generators, the most memory we will use is that which we need to store the largest text chunk in an iteration. Here the largest chunk is 101, 985 characters. If we need to work on the data then generators are the best way to.

Piping Between Files

When we don’t need to operate on the data, we can pass file data from one file to another by using piping. We can do this by using stream methods. Now we will show you the script to transfer from one file to another, so that we can measure the memory usage.

// from piping-files-1.php

file_put_contents(

"piping-files-1.txt", file_get_contents("phptutorial.txt"));

require "memory.php";

This script uses slightly more memory to run than the text file it copies. This is because it has to read and keep the file contents in memory until it has written to the new file. It is ok for small files but not with large files. Now we will do steaming or piping from one file to another.

// from piping-files-2.php

$handle1 = fopen("phptutorial.txt", "r");$handle2 = fopen("piping-files-2.txt", "w");

stream_copy_to_stream($handle1, $handle2);

fclose($handle1);fclose($handle2);

require "memory.php";

In this code, we open handles to both files, the first in read mode and the second in write mode. Then we copy from the first into the second. Finally we close both files. You may be surprised to know that the memory usage is 393 KB. Piping this text is not useful.

Some Streams

There are some streams we can pipe and/or write to and/or read from:-

php://stdin (read-only)

php://stderr (write-only, like php://stdout)

php://input (read-only) which gives us access to the raw request body

php://output (write-only) which lets us write to an output buffer

php://memory and php://temp (read-write) are places we can store data temporarily. The main difference is that php://temp will store the data in the file system after it becomes large enough, while php://memory will keep storing in memory until it runs out of memory.

Filters

There is another thing we can use with steams is filters. They are kind of in between steps, providing a bit of control over the steam data without exposing it. Imagine if we want to compress our phptutorial.txt file. We can use the Zip extension.

// from filters-1.php

$zip = new ZipArchive();$filename = "filters-1.zip";

$zip->open($filename, ZipArchive::CREATE);$zip->addFromString("phptutorial.txt", file_get_contents("phptutorial.txt"));$zip->close();

require "memory.php";

This is a clear code but it clocks in around 10.75 MB. We can do better with filters.

// from filters-2.php

$handle1 = fopen(

"php://filter/zlib.deflate/resource=phptutorial.txt", "r");

$handle2 = fopen(

"filters-2.deflated", "w");

stream_copy_to_stream($handle1, $handle2);

fclose($handle1);fclose($handle2);

require "memory.php";

Now here we can see the php://filter/zlib.deflate filter which reads and compresses the contents of a resource. After that we can pipe this compressed data into another file which used only 896KB.

Customizing Streams

fopen and file_get_contents have their own set of default options but these are completely customizable. To understand and use them we need to create a new stream context:-

// from creating-contexts-1.php

$data = join("&", [

"twitter=phpnews",]);

$headers = join("\r\n", [

"Content-type: application/x-www-form-urlencoded",

"Content-length: " . strlen($data),]);

$options = [

"http" => [

"method" => "POST",

"header"=> $headers,

"content" => $data,

],];

$context = stream_content_create($options);

$handle = fopen("https://example.com/register", "r", false, $context);$response = stream_get_contents($handle);

fclose($handle);

Here we are trying to make a POST request to an API. The endpoint of API is secure but we still need to use the http context property as is used for http and https. We have set a few headers and open a file handle to the API. We can open the handle as read-only since the context takes care of the writing.

Making Custom Protocols and Filters

Before we end this article, let’s talk about making custom protocols. Look at the documentation and you can find an example class to implement:-

Protocol {

public resource $context;

public __construct ( void )

public __destruct ( void )

public bool dir_closedir ( void )

public bool dir_opendir ( string $path , int $options )

public string dir_readdir ( void )

public bool dir_rewinddir ( void )

public bool mkdir ( string $path , int $mode , int $options )