Mr Branding

Thursday, November 16, 2017

How to Read Big Files with PHP (Without Killing Your Server)

It’s not often that we, as PHP developers, need to worry about memory management. The PHP engine does a stellar job of cleaning up after us, and the web server model of short-lived execution contexts means even the sloppiest code has no long-lasting effects.

There are rare times when we may need to step outside of this comfortable boundary --- like when we're trying to run Composer for a large project on the smallest VPS we can create, or when we need to read large files on an equally small server.

Fragmented terrain

It’s the latter problem we'll look at in this tutorial.

The code for this tutorial can be found on GitHub.

Measuring Success

The only way to be sure we’re making any improvement to our code is to measure a bad situation and then compare that measurement to another after we’ve applied our fix. In other words, unless we know how much a “solution” helps us (if at all), we can’t know if it really is a solution or not.

There are two metrics we can care about. The first is CPU usage. How fast or slow is the process we want to work on? The second is memory usage. How much memory does the script take to execute? These are often inversely proportional --- meaning that we can offload memory usage at the cost of CPU usage, and vice versa.

In an asynchronous execution model (like with multi-process or multi-threaded PHP applications), both CPU and memory usage are important considerations. In traditional PHP architecture, these generally become a problem when either one reaches the limits of the server.

It's impractical to measure CPU usage inside PHP. If that’s the area you want to focus on, consider using something like top, on Ubuntu or macOS. For Windows, consider using the Linux Subsystem, so you can use top in Ubuntu.

For the purposes of this tutorial, we’re going to measure memory usage. We’ll look at how much memory is used in “traditional” scripts. We’ll implement a couple of optimization strategies and measure those too. In the end, I want you to be able to make an educated choice.

The methods we’ll use to see how much memory is used are:

// formatBytes is taken from the php.net documentation

memory_get_peak_usage();

function formatBytes($bytes, $precision = 2) {
    $units = array("b", "kb", "mb", "gb", "tb");

    $bytes = max($bytes, 0);
    $pow = floor(($bytes ? log($bytes) : 0) / log(1024));
    $pow = min($pow, count($units) - 1);

    $bytes /= (1 << (10 * $pow));

    return round($bytes, $precision) . " " . $units[$pow];
}

We’ll use these functions at the end of our scripts, so we can see which script uses the most memory at one time.

What Are Our Options?

There are many approaches we could take to read files efficiently. But there are also two likely scenarios in which we could use them. We could want to read and process data all at the same time, outputting the processed data or performing other actions based on what we read. We could also want to transform a stream of data without ever really needing access to the data.

Let’s imagine, for the first scenario, that we want to be able to read a file and create separate queued processing jobs every 10,000 lines. We’d need to keep at least 10,000 lines in memory, and pass them along to the queued job manager (whatever form that may take).

For the second scenario, let’s imagine we want to compress the contents of a particularly large API response. We don’t care what it says, but we need to make sure it’s backed up in a compressed form.

In both scenarios, we need to read large files. In the first, we need to know what the data is. In the second, we don’t care what the data is. Let’s explore these options…

Reading Files, Line By Line

There are many functions for working with files. Let’s combine a few into a naive file reader:

// from memory.php

function formatBytes($bytes, $precision = 2) {
    $units = array("b", "kb", "mb", "gb", "tb");

    $bytes = max($bytes, 0);
    $pow = floor(($bytes ? log($bytes) : 0) / log(1024));
    $pow = min($pow, count($units) - 1);

    $bytes /= (1 << (10 * $pow));

    return round($bytes, $precision) . " " . $units[$pow];
}

print formatBytes(memory_get_peak_usage());

// from reading-files-line-by-line-1.php

function readTheFile($path) {
    $lines = [];
    $handle = fopen($path, "r");

    while(!feof($handle)) {
        $lines[] = trim(fgets($handle));
    }

    fclose($handle);
    return $lines;
}

readTheFile("shakespeare.txt");

require "memory.php";

We’re reading a text file containing the complete works of Shakespeare. The text file is about 5.5MB, and the peak memory usage is 12.8MB. Now, let’s use a generator to read each line:

// from reading-files-line-by-line-2.php

function readTheFile($path) {
    $handle = fopen($path, "r");

    while(!feof($handle)) {
        yield trim(fgets($handle));
    }

    fclose($handle);
}

readTheFile("shakespeare.txt");

require "memory.php";

The text file is the same size, but the peak memory usage is 393KB. This doesn’t mean anything until we do something with the data we’re reading. Perhaps we can split the document into chunks whenever we see two blank lines. Something like this:

// from reading-files-line-by-line-3.php

$iterator = readTheFile("shakespeare.txt");

$buffer = "";

foreach ($iterator as $iteration) {
    preg_match("/\n{3}/", $buffer, $matches);

    if (count($matches)) {
        print ".";
        $buffer = "";
    } else {
        $buffer .= $iteration . PHP_EOL;
    }
}

require "memory.php";

Any guesses how much memory we’re using now? Would it surprise you to know that, even though we split the text document up into 1,216 chunks, we still only use 459KB of memory? Given the nature of generators, the most memory we’ll use is that which we need to store the largest text chunk in an iteration. In this case, the largest chunk is 101,985 characters.

I’ve already written about the performance boosts of using generators and Nikita Popov’s Iterator library, so go check that out if you’d like to see more!

Generators have other uses, but this one is demonstrably good for performant reading of large files. If we need to work on the data, generators are probably the best way.

Piping Between Files

In situations where we don’t need to operate on the data, we can pass file data from one file to another. This is commonly called piping (presumably because we don’t see what’s inside a pipe except at each end … as long as it's opaque, of course!). We can achieve this by using stream methods. Let’s first write a script to transfer from one file to another, so that we can measure the memory usage:

// from piping-files-1.php

file_put_contents(
    "piping-files-1.txt", file_get_contents("shakespeare.txt")
);

require "memory.php";

Unsurprisingly, this script uses slightly more memory to run than the text file it copies. That’s because it has to read (and keep) the file contents in memory until it has written to the new file. For small files, that may be okay. When we start to use bigger files, no so much…

Let’s try streaming (or piping) from one file to another:

// from piping-files-2.php

$handle1 = fopen("shakespeare.txt", "r");
$handle2 = fopen("piping-files-2.txt", "w");

stream_copy_to_stream($handle1, $handle2);

fclose($handle1);
fclose($handle2);

require "memory.php";

This code is slightly strange. We open handles to both files, the first in read mode and the second in write mode. Then we copy from the first into the second. We finish by closing both files again. It may surprise you to know that the memory used is 393KB.

That seems familiar. Isn’t that what the generator code used to store when reading each line? That’s because the second argument to fgets specifies how many bytes of each line to read (and defaults to -1 or until it reaches a new line).

The third argument to stream_copy_to_stream is exactly the same sort of parameter (with exactly the same default). stream_copy_to_stream is reading from one stream, one line at a time, and writing it to the other stream. It skips the part where the generator yields a value, since we don’t need to work with that value.

Piping this text isn’t useful to us, so let’s think of other examples which might be. Suppose we wanted to output an image from our CDN, as a sort of redirected application route. We could illustrate it with code resembling the following:

// from piping-files-3.php

file_put_contents(
    "piping-files-3.jpeg", file_get_contents(
        "http://ift.tt/2z7SdL5"
    )
);

// ...or write this straight to stdout, if we don't need the memory info

require "memory.php";

Imagine an application route brought us to this code. But instead of serving up a file from the local file system, we want to get it from a CDN. We may substitute file_get_contents for something more elegant (like Guzzle), but under the hood it’s much the same.

The memory usage (for this image) is around 581KB. Now, how about we try to stream this instead?

// from piping-files-4.php

$handle1 = fopen(
    "http://ift.tt/2z7SdL5", "r"
);

$handle2 = fopen(
    "piping-files-4.jpeg", "w"
);

// ...or write this straight to stdout, if we don't need the memory info

stream_copy_to_stream($handle1, $handle2);

fclose($handle1);
fclose($handle2);

require "memory.php";

The memory usage is slightly less (at 400KB), but the result is the same. If we didn’t need the memory information, we could just as well print to standard output. In fact, PHP provides a simple way to do this:

$handle1 = fopen(
    "http://ift.tt/2z7SdL5", "r"
);

$handle2 = fopen(
    "php://stdout", "w"
);

stream_copy_to_stream($handle1, $handle2);

fclose($handle1);
fclose($handle2);

// require "memory.php";

Other Streams

There are a few other streams we could pipe and/or write to and/or read from:

php://stdin (read-only)
php://stderr (write-only, like php://stdout)
php://input (read-only) which gives us access to the raw request body
php://output (write-only) which lets us write to an output buffer
php://memory and php://temp (read-write) are places we can store data temporarily. The difference is that php://temp will store the data in the file system once it becomes large enough, while php://memory will keep storing in memory until that runs out.

Continue reading %How to Read Big Files with PHP (Without Killing Your Server)%

by Christopher Pitt via SitePoint

How to Build a Buyer Persona: A Recipe for Success [Infographic]

Who is your ideal customer? This may seem like a simple question, but the answer is not so simple. Most businesses have a general idea of who they're best customers are, but many don't truly understand why those people are primary customers. They might have bits and pieces of information,...

[ This is a content summary only. Visit our website http://ift.tt/1b4YgHQ for full links, other content, and more! ]

by Irfan Ahmad via Digital Information World

Introducing a new dedicated template directory: OnePageTemplates.com

So finding a quality website template can be a very frustrating experience.

I get emails every day from readers struggling to find the right solution out the 400 templates listed on the One Page Love website.

I’m making it my new mission to solve this.

Starting with free templates, I’ve soft launched a new dedicated site OnePageTemplates.com – it’s got a long way to go but it’s blazingly fast and all links are direct downloads:

I’d absolutely love your feedback on the browsing experience and what I can do to improve:)

by Rob Hope @robhope via One Page Love

6 Free Stock Image, Music and Video Resources for Content Creators

Do you want to enhance the videos and graphics you produce for social media? Looking for free stock photos, video clips, and music sites? In this video you’ll discover 6 websites where content creators can find quality stock imagery and media files.

[ This is a content summary only. Visit our website http://ift.tt/1b4YgHQ for full links, other content, and more! ]

by Irfan Ahmad via Digital Information World

12 Ways to Turn Stress in to Productivity - #infographic

From the cradle to the grave, life hands us stressful situations — so much so that we should consider stress to be an integral part of living a full life, rather than a nuisance to be avoided at all costs. Rather than ignoring stress, it can be healthier and even more productive to harness the...

[ This is a content summary only. Visit our website http://ift.tt/1b4YgHQ for full links, other content, and more! ]

by Web Desk via Digital Information World

How to Easily Broadcast Multi-Camera Live Video for Facebook Live

Do you want to improve the quality of your live videos? Wondering how to integrate visuals and work with multiple camera angles? In this article, you’ll discover how to broadcast professional-quality live video to Facebook and YouTube. Why Brand Live Videos? Switcher Studio is a mobile production tool that lets you create professionally branded Facebook [...]

This post How to Easily Broadcast Multi-Camera Live Video for Facebook Live first appeared on .
- Your Guide to the Social Media Jungle

by Erin Cell via

Wind And Words

alizationUsing data collected from subtitles and other sources, this experiment analyzes and visualizes character interactions from the first six seasons of HBO’s Game of Thrones.
by via Awwwards - Sites of the day