Mr Branding: Web Crawling with Node, PhantomJS and Horseman

Monday, February 22, 2016

Web Crawling with Node, PhantomJS and Horseman

It is quite common during the course of a project to find yourself needing to write custom scripts for performing a variety of actions. Such one-off scripts, which are typically executed via the command-line (CLI), can be used for virtually any type of task. Having written many such scripts over the years, I have grown to appreciate the value of taking a small amount of time upfront to put in place a custom CLI microframework to facilitate this process. Fortunately, Node.js and its extensive package ecosystem, npm, make it easy to do just that. Whether parsing a text file or running an ETL, having a convention in place makes it easy to add new functionality in an efficient and structured way.

[author_more]

While not necessarily associated with the command-line, web crawling is often employed in certain problem domains like automated functional testing and defacement detection. This tutorial demonstrates how to implement a lightweight CLI framework whose supported actions revolve around web crawling. Hopefully, this will get your creative juices flowing, whether your interest be specific to crawling or to the command-line. Technologies covered include Node.js, PhantomJS, and an assortment of npm packages related to both crawling and the CLI.

The source code for this tutorial can be found on GitHub. In order to run the examples, you will need to have both Node.js and PhantomJS installed. Instructions for downloading and installing them can be found here: Node.js, PhantomJS.

Setting up a Basic Command-Line Framework

At the heart of any CLI framework is the concept of converting a command, which typically includes one or more optional or required arguments, into a concrete action. Two npm packages that are quite helpful in this regard are commander and prompt.

Commander allows you to define which arguments are supported, while prompt allows you to (appropriately enough) prompt the user for input at runtime. The end result is a syntactically sweet interface for performing a variety of actions with dynamic behaviors based on some user-supplied data.

Say, for example, we want our command to look like this:

$ node run.js -x hello_world

Our entry point (run.js) defines the possible arguments like this:

program
  .version('1.0.0')
  .option('-x --action-to-perform [string]', 'The type of action to perform.')
  .option('-u --url [string]', 'Optional URL used by certain actions')
  .parse(process.argv);

and defines the various user input cases like this:

var performAction = require('./actions/' + program.actionToPerform)

switch (program.actionToPerform) {
  case 'hello_world':
    prompt.get([{

      // What the property name should be in the result object
      name: 'url',

      // The prompt message shown to the user
      description: 'Enter a URL',

      // Whether or not the user is required to enter a value
      required: true,

      // Validates the user input
      conform: function (value) {

        // In this case, the user must enter a valid URL
        return validUrl.isWebUri(value);
      }
    }], function (err, result) {

      // Perform some action following successful input
      performAction(phantomInstance, result.url);
    });
    break;
}

At this point, we've defined a basic path through which we can specify an action to perform, and have added a prompt to accept a URL. We just need to add a module to handle the logic that is specific to this action. We can do this by adding a file named hello_world.js to the actions directory:

'use strict';

/**
 * @param Horseman phantomInstance
 * @param string url
 */
module.exports = function (phantomInstance, url) {

  if (!url || typeof url !== 'string') {
    throw 'You must specify a url to ping';
  } else {
    console.log('Pinging url: ', url);
  }

  phantomInstance
    .open(url)
    .status()
    .then(function (statusCode) {
      if (Number(statusCode) >= 400) {
        throw 'Page failed with status: ' + statusCode;
      } else {
        console.log('Hello world. Status code returned: ', statusCode);
      }
    })
    .catch(function (err) {
      console.log('Error: ', err);
    })

    // Always close the Horseman instance
    // Otherwise you might end up with orphaned phantom processes
    .close();
};

As you can see, the module expects to be supplied with an instance of a PhantomJS object (phantomInstance) and a URL (url). We will get into the specifics of defining a PhantomJS instance momentarily, but for now it's enough to see that we've laid the groundwork for triggering a particular action. Now that we've put a convention in place, we can easily add new actions in a defined and sane manner.

Crawling with PhantomJS Using Horseman

Horseman is a Node.js package that provides a powerful interface for creating and interacting with PhantomJS processes. A comprehensive explanation of Horseman and its features would warrant its own article, but suffice to say that it allows you to easily simulate just about any behavior that a human user might exhibit in their browser. Horseman provides a wide range of configuration options, including things like automatically injecting jQuery and ignoring SSL certificate warnings. It also provides features for cookie handling and taking screenshots.

Each time we trigger an action through our CLI framework, our entry script (run.js) instantiates an instance of Horseman and passes it along to the specified action module. In pseudo-code it looks something like this:

var phantomInstance = new Horseman({
  phantomPath: '/usr/local/bin/phantomjs',
  loadImages: true,
  injectJquery: true,
  webSecurity: true,
  ignoreSSLErrors: true
});

performAction(phantomInstance, ...);

Now when we run our command, the Horseman instance and input URL get passed to the hello_world module, causing PhantomJS to request the URL, capture its status code, and print the status to the console. We've just run our first bona fide crawl using Horseman. Giddyup!

Output of running hello_world command

Chaining Horseman Methods for Complex Interactions

So far we've looked at a very simple usage of Horseman, but the package can do much more when we chain its methods together to perform a sequence of actions in the browser. In order to demonstrate a few of these features, let's define an action that simulates a user navigating through GitHub to create a new repository.

Please note: This example is purely for demonstration purposes and should not be considered a viable method for creating Github repositories. It is merely an example of how one could use Horseman to interact with a web application. You should use the official Github API if you are interested in creating repositories in an automated fashion.

Let us suppose that the new crawl will be triggered like so:

$ node run.js -x create_repo

Continue reading %Web Crawling with Node, PhantomJS and Horseman%

by Andrew Forth via SitePoint

Mr Branding

Monday, February 22, 2016

Web Crawling with Node, PhantomJS and Horseman

Setting up a Basic Command-Line Framework

Crawling with PhantomJS Using Horseman

Chaining Horseman Methods for Complex Interactions

No comments:

Post a Comment