Tuesday, September 27, 2016

Build a Search Engine with Node.js and Elasticsearch

Elasticsearch is an open source search engine, which is gaining popularity due to its high performance and distributed architecture. In this article, I will discuss its key features and walk you through the process of using it to create a Node.js search engine.

Introduction to Elasticsearch

Elasticsearch is built on top of Apache Lucene, which is a high performance text search engine library. Although Elasticsearch can perform the storage and retrieval of data, its main purpose is not to serve as a database, rather it is a search engine (server) with the main goal of indexing, searching, and providing real-time statistics on the data.

Elasticsearch has a distributed architecture that allows horizontal scaling by adding more nodes and taking advantage of the extra hardware. It supports thousands of nodes for processing petabytes of data. Its horizontal scaling also means that it has a high availability by rebalancing the data if ever any nodes fail.

When data is imported, it immediately becomes available for searching. Elasticsearch is schema-free, stores data in JSON documents, and can automatically detect the data structure and type.

Elasticsearch is also completely API driven. This means that almost any operations can be done via a simple RESTful API using JSON data over HTTP. It has many client libraries for almost any programming language, including for Nodes.js. In this tutorial we will use the official client library.

Elasticsearch is very flexible when it comes to hardware and software requirements. Although the recommended production setting is 64GB memory and as many CPU cores as possible, you can still run it on a resource-constrained system and get decent performance (assuming your data set is not huge). For following the examples in this article, a system with 2GB memory and a single CPU core will suffice.

You can run Elasticsearch on all major operating systems (Linux, Mac OS, and Windows). To do so, you need the latest version of the Java Runtime Environment installed (see the Installing Elasticsearch section). To follow the examples in this article, you'll also need to have Node.js installed (any version after v0.11.0 will do), as well as npm.

Elasticsearch terminology

Elasticsearch uses its own terminology, which in some cases is different from typical database systems. Below, is a list of common terms in Elasticsearch and their meaning.

Index: This term has two meanings in Elasticsearch context. First is the operation of adding data. When data is added, the text is broken down into tokens (e.g. words) and every token is indexed. However, an index also refers to where are all the indexed data is stored. Basically, when you import data, it is indexed into an index. Every time you want to perform any operation on data, you need to specify its index name.

Type: Elasticsearch provides a more detailed categorization of documents within an index, which is called type. Every document in an index, should also have a type. For example, we can define a library index, then index multiple types of data such as article, book, report, and presentation into it. Since indices have almost fixed overhead, it is recommended to have fewer indices and more types, rather than more indices and fewer types.

Search: This term means what you might think. You can search data in different indices and types. Elasticsearch provides many types of search queries such as term, phrase, range, fuzzy, and even queries for geo data.

Filter: Elasticsearch allows you to filter search results based on different criteria, to further narrow down the results. If you add new search queries to a set of documents, it might change the order based on relevancy, but if you add the same query as a filter, the order remains unchanged.

Aggregations: These provide you with different types of statistics on aggregated data, such as minimum, maximum, average, summation, histograms, and so on.

Suggestions: Elasticsearch provides different types of suggestions for input text. These suggestions could be term or phrase based, or even completion suggestions.

Installing Elasticsearch

Elasticsearch is available under the Apache 2 license; it can be downloaded, used, and modified free of charge. Before installing it, you need to make sure you have the Java Runtime Environment (JRE) installed on your computer. Elasticsearch is written in Java and relies on Java libraries to run. To check whether you have Java installed on your system, you can type the following in the command line.

java -version

Using the latest stable version of the Java is recommended (1.8 at the time of writing this article). You can find a guide for installing Java on your system here.

Next, to download the latest version of Elasticsearch (2.3.5 at the time of writing this article), go to the download page and download the ZIP file. Elasticsearch requires no installation and the single zip file contains the complete set of files to run the program on all of the supported operating systems. Unzip the downloaded file and you are done! There are several other ways to get Elasticsearch running, such as getting the TAR file or packages for different Linux distributions (look here).

If you are running Mac OS X and you have Homebrew installed, you can install Elasticsearch using brew install elasticsearch. Homebrew automatically adds the executables to your path and installs the required services. It also helps you update the application with a single command: brew upgrade elasticsearch.

To run Elasticsearch on Windows, from the unzipped directory, run bin\elasticsearch.bat from the command line. For every other OS, run ./bin/elasticsearch from the terminal. At this point it should be running on your system.

As I mentioned earlier, almost all operations you can do with Elasticsearch, can be done via RESTful APIs. Elasticsearch uses port 9200 by default. To make sure you are running it correctly, head to http://localhost:9200/ in your browser, and it should display some basic information about your running instance.

For further reading about installation and troubleshooting, you can visit the documentations.

Graphical User Interface

Elasticsearch provides almost all its functionality through REST APIs and does not ship with a graphical user interface (GUI). While I cover how you can perform all the necessary operations through APIs and Node.js, there are several GUI tools that provide visual information about indices and data, and even some high level analytics.

Kibana, which is developed by the same company, provides a real-time summary of the data, plus several customized visualization and analytics options. Kibana is free and has detailed documentation.

There are other tools developed by the community, including elasticsearch-head, Elasticsearch GUI, and even a Chrome extension called ElasticSearch Toolbox. These tools help you explore your indices and data in the browser, and even try out different search and aggregation queries. All these tools provide a walkthrough for installation and use.

Setting Up a Node.js Environment

Elasticsearch provides an official module for Node.js, called elasticsearch. First, you need to add the module to your project folder, and save the dependency for future use.

npm install elasticsearch --save

Then, you can import the module in your script as follows:

const elasticsearch = require('elasticsearch');

Finally, you need to set up the client that handles the communication with Elasticsearch. In this case, I assume you are running Elasticsearch on your local machine with an IP address of 127.0.0.1 and the port 9200 (default setting).

const esClient = new elasticsearch.Client({
  host: '127.0.0.1:9200',
  log: 'error'
});

The log options ensures that all the errors are logged. In the rest of this article, I will use the same esClient object to communicate with Elasticsearch. The complete documentation for the node module is provided here.

Note: all of the source code for this tutorial is provided on GitHub. The easiest way to follow along is to clone the repo to your PC and run the examples from there:

git clone https://github.com:sitepoint-editors/node-elasticsearch-tutorial.git
cd node-elasticsearch-tutorial
npm install

Importing the Data

Throughout this tutorial, I will use an academic articles dataset with randomly generated content. The data is provided in JSON format, and there are 1000 articles in the dataset. To show what the data looks like, one item from the dataset is shown below.

{
    "_id": "57508457f482c3a68c0a8ab3",
    "title": "Nostrud anim proident cillum non.",
    "journal": "qui ea",
    "volume": 54,
    "number": 11,
    "pages": "109-117",
    "year": 2014,
    "authors": [
      {
        "firstname": "Allyson",
        "lastname": "Ellison",
        "institution": "Ronbert",
        "email": "Allyson@Ronbert.tv"
      },
      ...
    ],
    "abstract": "Do occaecat reprehenderit dolore ...",
    "link": "http://ift.tt/2di0OLV",
    "keywords": [
      "sunt",
      "fugiat",
      ...
    ],
    "body": "removed to save space"
  }

The field names are self-explanatory. The only point to note is that the body field is not displayed here, since it contains a complete, randomly generated article (with between 100 and 200 paragraphs). You can find the complete data set here.

While Elasticsearch provides methods for indexing, updating, and deleting single data points, we're going to make use of Elasticserch's bulk method to import the data, which is used to perform operations on large data sets in a more efficient manner:

// index.js

const bulkIndex = function bulkIndex(index, type, data) {
  let bulkBody = [];

  data.forEach(item => {
    bulkBody.push({
      index: {
        _index: index,
        _type: type,
        _id: item.id
      }
    });

    bulkBody.push(item);
  });

  esClient.bulk({body: bulkBody})
  .then(response => {
    console.log('here');
    let errorCount = 0;
    response.items.forEach(item => {
      if (item.index && item.index.error) {
        console.log(++errorCount, item.index.error);
      }
    });
    console.log(
      `Successfully indexed ${data.length - errorCount}
       out of ${data.length} items`
    );
  })
  .catch(console.err);
};

const test = function test() {
  const articlesRaw = fs.readFileSync('data.json');
  bulkIndex('library', 'article', articles);
};

Here, we are calling the bulkIndex function passing it library as the index name, article as the type and the JSON data we wish to have indexed. The bulkIndex function in turn calls the bulk method on the esClient object. This method takes an object with a body property as an argument. The value supplied to the body property is an array with two entries for each operation. In the first entry, the type of the operation is specified as a JSON object. Within this object, the index property determines the operation to be performed (indexing a document in this case), as well as the index name, type name, and the document ID. The next entry corresponds to the document itself.

Note that in future, you might add other types of documents (such as books or reports) to the same index in this way. We could also assign a unique ID to each document, but this is optional — if you do not provide one, Elasticsearch will assign a unique randomly generated ID to each document for you.

Assuming you have cloned the repository, you can now import the data into Elasticsearch by executing the following command from the project root:

$ node index.js
1000 items parsed from data file
Successfully indexed 1000 out of 1000 items

Checking the data was indexed correctly

One of the great features of Elasticsearch is near real-time search. This means that once documents are indexed, they become available for search within one second (see here). Once the data is indexed, you can check the index information by running indices.js (link to source):

// indices.js

const indices = function indices() {
  return esClient.cat.indices({v: true})
  .then(console.log)
  .catch(err => console.error(`Error connecting to the es client: ${err}`));
};

Methods in the client's cat object provide different information about the current running instance. The indices method lists all the indices, their health status, number of their documents, and their size on disk. The v option adds a header to the response from the cat methods.

When you run the above snippet, you will notice it outputs a color code to indicate the health status of your cluster. Red indicates something is wrong with your cluster and it is not running. Yellow means the cluster is running, but there is a warning, and green means everything is working fine. Most likely (depending on your setting) you will get a yellow status when running on your local machine. This is because the default settings contain five nodes for the cluster, but in your local machine there is only one instance running. While you should always aim for green status in a production environment, for the purpose of this tutorial you can continue to use Elasticsearch in yellow status.

$ node indices.js
elasticsearch indices information:
health status index   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   library   5   1       1000            0     41.2mb         41.2mb

Continue reading %Build a Search Engine with Node.js and Elasticsearch%


by Behrooz Kamali via SitePoint

No comments:

Post a Comment