Friday, August 25, 2017

ReactJS in PHP: Writing Compilers Is Easy and Fun!

I used to use an extension called XHP. It enables HTML-in-PHP syntax for generating front-end markup. I reached for it recently, and was surprised to find that it was no longer officially supported for modern PHP versions.

So, I decided to implement a user-land version of it, using a basic state-machine compiler. It seemed like it would be a fun project to do with you!

The code for this tutorial can be found on Github.

Abstract image of blocks coming together

Creating Compilers

Many developers avoid writing their own compilers or interpreters, thinking that the topic is too complex or difficult to explore properly. I used to feel like that too. Compilers can be difficult to make well, and the topic can be incredibly complex and difficult. But, that doesn’t mean you can’t make a compiler.

Making a compiler is like making a sandwich. Anyone can get the ingredients and put it together. You can make a sandwich. You can also go to chef school and learn how to make the best damn sandwich the world has ever seen. You can study the art of sandwich making for years, and people can talk about your sandwiches in other lands. You’re not going to let the breadth and complexity of sandwich-making prevent you from making your first sandwich, are you?

Compilers (and interpreters) begin with humble string manipulation and temporary variables. When they’re sufficiently popular (or sufficiently slow) then the experts can step in; to replace the string manipulation and temporary variables with unicorn tears and cynicism.

At a fundamental level, compilers take a string of code and run it through a couple of steps:

  1. The code is split into tokens – meaningful characters and sub-strings – which the compiler will use to derive meaning. The statement if (isEmergency) alert("there is an emergency") could be considered to contain tokens like if, isEmergency, alert, and "there is an emergency"; and these all mean something to the compiler.

    The first step is to split the entire source code up into these meaningful bits, so that the compiler can start to organize them in a logical hierarchy, so it knows what to do with the code.

  2. The tokens are arranged into the logical hierarchy (sometimes called an Abstract Syntax Tree) which represents what needs to be done in the program. The previous statement could be understood as “Work out if the condition (isEmergency) evaluates to true. If it does, run the function (alert) with the parameter ("there is an emergency")”.

Using this hierarchy, the code can be immediately executed (in the case of an interpreter or virtual machine) or translated into other languages (in the case of languages like CoffeeScript and TypeScript, which are both compile-to-Javascript languages).

In our case, we want to maintain most of the PHP syntax, but we also want to add our own little bit of syntax on top. We could create a whole new interpreter…or we could preprocess the new syntax, compiling it to syntactically valid PHP code.

I’ve written about preprocessing PHP before, and it’s my favorite approach to adding new syntax. In this case, we need to write a more complex script; so we’re going to deviate from how we’ve previously added new syntax.

Generating Tokens

Let’s create a function to split code into tokens. It begins like this:

function tokens($code) {
    $tokens = [];

    $length = strlen($code);
    $cursor = 0;

    while ($cursor < $length) {
        if ($code[$cursor] === "{") {
            print "ATTRIBUTE STARTED ({$cursor})" . PHP_EOL;
        }

        if ($code[$cursor] === "}") {
            print "ATTRIBUTE ENDED ({$cursor})" . PHP_EOL;
        }

        if ($code[$cursor] === "<") {
            print "ELEMENT STARTED ({$cursor})" . PHP_EOL;
        }

        if ($code[$cursor] === ">") {
            print "ELEMENT ENDED ({$cursor})" . PHP_EOL;
        }

        $cursor++;
    }
}

$code = '
    <?php

    $classNames = "foo bar";
    $message = "hello world";

    $thing = (
        <div
            className={() => { return "outer-div"; }}
            nested={<span className={"nested-span"}>with text</span>}
        >
            a bit of text before
            <span>
                {$message} with a bit of extra text
            </span>
            a bit of text after
        </div>
    );
';

tokens($code);

// ELEMENT STARTED (5)
// ELEMENT STARTED (95)
// ATTRIBUTE STARTED (122)
// ELEMENT ENDED (127)
// ATTRIBUTE STARTED (129)
// ATTRIBUTE ENDED (151)
// ATTRIBUTE ENDED (152)
// ATTRIBUTE STARTED (173)
// ELEMENT STARTED (174)
// ATTRIBUTE STARTED (190)
// ATTRIBUTE ENDED (204)
// ELEMENT ENDED (205)
// ELEMENT STARTED (215)
// ELEMENT ENDED (221)
// ATTRIBUTE ENDED (222)
// ELEMENT ENDED (232)
// ELEMENT STARTED (279)
// ELEMENT ENDED (284)
// ATTRIBUTE STARTED (302)
// ATTRIBUTE ENDED (311)
// ELEMENT STARTED (350)
// ELEMENT ENDED (356)
// ELEMENT STARTED (398)
// ELEMENT ENDED (403)

This is from tokens-1.php

We’re off to a good start. By stepping through the code, we can check to see what each character is (and identify the ones that matter to us). We’re seeing, for instance, that the first element opens when we encounter a < character, at index 5. The first element closes at index 210.

Unfortunately, that first opening is being incorrectly matched to <?php. That’s not an element in our new syntax, so we have to stop the code from picking it out:

preg_match("#^</?[a-zA-Z]#", substr($code, $cursor, 3), $matchesStart);

if (count($matchesStart)) {
    print "ELEMENT STARTED ({$cursor})" . PHP_EOL;
}

// ...

// ELEMENT STARTED (95)
// ATTRIBUTE STARTED (122)
// ELEMENT ENDED (127)
// ATTRIBUTE STARTED (129)
// ATTRIBUTE ENDED (151)
// ATTRIBUTE ENDED (152)
// ATTRIBUTE STARTED (173)
// ELEMENT STARTED (174)
// ...

This is from tokens-2.php

Instead of checking only the current character, our new code checks three characters: if they match the pattern <div or </div, but not <?php or $num1 < $num2.

There’s another problem: our example uses arrow function syntax, so => is being matched as an element closing sequence. Let’s refine how we match element closing sequences:

preg_match("#^=>#", substr($code, $cursor - 1, 2), $matchesEqualBefore);
preg_match("#^>=#", substr($code, $cursor, 2), $matchesEqualAfter);

if ($code[$cursor] === ">" && !$matchesEqualBefore && !$matchesEqualAfter) {
    print "ELEMENT ENDED ({$cursor})" . PHP_EOL;
}

// ...

// ELEMENT STARTED (95)
// ATTRIBUTE STARTED (122)
// ATTRIBUTE STARTED (129)
// ATTRIBUTE ENDED (151)
// ATTRIBUTE ENDED (152)
// ATTRIBUTE STARTED (173)
// ELEMENT STARTED (174)
// ...

This is from tokens-3.php

As with JSX, it would be good for attributes to allow dynamic values (even if those values are nested JSX elements). There are a few ways we could do this, but the one I prefer is to treat all attributes as text, and tokenize them recursively. To do this, we need to have a kind of state machine which tracks how many levels deep we are in an element and attribute. If we’re inside an element tag, we should trap the top level {…} as a string attribute value, and ignore subsequent braces. Similarly, if we’re inside an attribute, we should ignore nested element opening and closing sequences:

function tokens($code) {
    $tokens = [];

    $length = strlen($code);
    $cursor = 0;

    $elementLevel = 0;
    $elementStarted = null;
    $elementEnded = null;

    $attributes = [];
    $attributeLevel = 0;
    $attributeStarted = null;
    $attributeEnded = null;

    while ($cursor < $length) {
        $extract = trim(substr($code, $cursor, 5)) . "...";

        if ($code[$cursor] === "{" && $elementStarted !== null) {
            if ($attributeLevel === 0) {
                print "ATTRIBUTE STARTED ({$cursor}, {$extract})" . PHP_EOL;
                $attributeStarted = $cursor;
            }

            $attributeLevel++;
        }

        if ($code[$cursor] === "}" && $elementStarted !== null) {
            $attributeLevel--;

            if ($attributeLevel === 0) {
                print "ATTRIBUTE ENDED ({$cursor})" . PHP_EOL;
                $attributeEnded = $cursor;
            }
        }

        preg_match("#^</?[a-zA-Z]#", substr($code, $cursor, 3), $matchesStart);

        if (count($matchesStart) && $attributeLevel < 1) {
            print "ELEMENT STARTED ({$cursor}, {$extract})" . PHP_EOL;

            $elementLevel++;
            $elementStarted = $cursor;
        }

        preg_match("#^=>#", substr($code, $cursor - 1, 2), $matchesEqualBefore);
        preg_match("#^>=#", substr($code, $cursor, 2), $matchesEqualAfter);

        if (
            $code[$cursor] === ">"
            && !$matchesEqualBefore && !$matchesEqualAfter
            && $attributeLevel < 1
        ) {
            print "ELEMENT ENDED ({$cursor})" . PHP_EOL;

            $elementLevel--;
            $elementEnded = $cursor;
        }

        if ($elementStarted && $elementEnded) {
            // TODO

            $elementStarted = null;
            $elementEnded = null;
        }

        $cursor++;
    }
}

// ...

// ELEMENT STARTED (95, <div...)
// ATTRIBUTE STARTED (122, {() =...)
// ATTRIBUTE ENDED (152)
// ATTRIBUTE STARTED (173, {<spa...)
// ATTRIBUTE ENDED (222)
// ELEMENT ENDED (232)
// ELEMENT STARTED (279, <span...)
// ELEMENT ENDED (284)
// ELEMENT STARTED (350, </spa...)
// ELEMENT ENDED (356)
// ELEMENT STARTED (398, </div...)
// ELEMENT ENDED (403)

This is from tokens-4.php

We’ve added new $attributeLevel, $attributeStarted, and $attributeEnded variables; to track how deep we are in the nesting of attributes, and where the top-level starts and ends. Specifically, if we’re at the top level when an attribute’s value starts or ends, we capture the current cursor position. Later, we’ll use this to extract the string attribute value and replace it with a placeholder.

We’re also starting to capture $elementStarted and $elementEnded (with $elementLevel fulfilling a similar role to $attributeLevel) so that we can capture a full element opening or closing tag. In this case, $elementEnded doesn’t refer to the closing tag but rather the closing sequence of characters of the opening tag. Closing tags are treated as entirely separate tokens…

After extracting a small substring after the current cursor position, we can see elements and attributes starting and ending exactly where we expect. The nested control structures and elements are captured as strings, leaving only the top-level elements, non-attribute nested elements, and attribute values.

Let’s package these tokens up, associating attributes with the tags in which they are defined:

function tokens($code) {
    $tokens = [];

    $length = strlen($code);
    $cursor = 0;

    $elementLevel = 0;
    $elementStarted = null;
    $elementEnded = null;

    $attributes = [];
    $attributeLevel = 0;
    $attributeStarted = null;
    $attributeEnded = null;

    $carry = 0;

    while ($cursor < $length) {
        if ($code[$cursor] === "{" && $elementStarted !== null) {
            if ($attributeLevel === 0) {
                $attributeStarted = $cursor;
            }

            $attributeLevel++;
        }

        if ($code[$cursor] === "}" && $elementStarted !== null) {
            $attributeLevel--;

            if ($attributeLevel === 0) {
                $attributeEnded = $cursor;
            }
        }

        if ($attributeStarted && $attributeEnded) {
            $position = (string) count($attributes);
            $positionLength = strlen($position);

            $attribute = substr(
                $code, $attributeStarted + 1, $attributeEnded - $attributeStarted - 1
            );

            $attributes[$position] = $attribute;

            $before = substr($code, 0, $attributeStarted + 1);
            $after = substr($code, $attributeEnded);

            $code = $before . $position . $after;

            $cursor = $attributeStarted + $positionLength + 2 /* curlies */;
            $length = strlen($code);

            $attributeStarted = null;
            $attributeEnded = null;

            continue;
        }

        preg_match("#^</?[a-zA-Z]#", substr($code, $cursor, 3), $matchesStart);

        if (count($matchesStart) && $attributeLevel < 1) {
            $elementLevel++;
            $elementStarted = $cursor;
        }

        preg_match("#^=>#", substr($code, $cursor - 1, 2), $matchesEqualBefore);
        preg_match("#^>=#", substr($code, $cursor, 2), $matchesEqualAfter);

        if (
            $code[$cursor] === ">"
            && !$matchesEqualBefore && !$matchesEqualAfter
            && $attributeLevel < 1
        ) {
            $elementLevel--;
            $elementEnded = $cursor;
        }

        if ($elementStarted !== null && $elementEnded !== null) {
            $distance = $elementEnded - $elementStarted;

            $carry += $cursor;

            $before = trim(substr($code, 0, $elementStarted));
            $tag = trim(substr($code, $elementStarted, $distance + 1));
            $after = trim(substr($code, $elementEnded + 1));

            $token = ["tag" => $tag, "started" => $carry];

            if (count($attributes)) {
                $token["attributes"] = $attributes;
            }

            $tokens[] = $before;
            $tokens[] = $token;

            $attributes = [];

            $code = $after;
            $length = strlen($code);
            $cursor = 0;

            $elementStarted = null;
            $elementEnded = null;

            continue;
        }

        $cursor++;
    }

    return $tokens;
}

$code = '
    <?php

    $classNames = "foo bar";
    $message = "hello world";

    $thing = (
        <div
            className={() => { return "outer-div"; }}
            nested={<span className={"nested-span"}>with text</span>}
        >
            a bit of text before
            <span>
                {$message} with a bit of extra text
            </span>
            a bit of text after
        </div>
    );
';

tokens($code);

// Array
// (
//     [0] => <?php
//
//     $classNames = "foo bar";
//     $message = "hello world";
//
//     $thing = (
//     [1] => Array
//         (
//             [tag] => <div className={0} nested={1}>
//             [started] => 157
//             [attributes] => Array
//                 (
//                     [0] => () => { return "outer-div"; }
//                     [1] => <span className={"nested-span"}>with text</span>
//                 )
//
//         )
//
//     [2] => a bit of text before
//     [3] => Array
//         (
//             [tag] => <span>
//             [started] => 195
//         )
//
//     [4] => {$message} with a bit of extra text
//     [5] => Array
//         (
//             [tag] => </span>
//             [started] => 249
//         )
//
//     [6] => a bit of text after
//     [7] => Array
//         (
//             [tag] => </div>
//             [started] => 282
//         )
//
// )

This is from tokens-5.php

There’s a lot going on here, but it’s all just a natural progression from the previous version. We use the captured attribute start and end positions to extract the entire attribute value as one big string. We then replace each captured attribute with a numeric placeholder and reset the code string and cursor positions.

As each element closes, we associate all the attributes since the element was opened, and create a separate array token from the tag (with its placeholders), attributes and starting position. The result may be a little harder to read, but it is spot on in terms of capturing the intent of the code.

So, what do we do about those nested element attributes?

Continue reading %ReactJS in PHP: Writing Compilers Is Easy and Fun!%


by Christopher Pitt via SitePoint

No comments:

Post a Comment