Wednesday, July 1, 2015

Crawling and Searching Entire Domains with Diffbot

In this tutorial, I’ll show you how to build a custom SitePoint search engine that far outdoes anything WordPress could ever put out. We’ll be using Diffbot as a service to extract structured data from SitePoint automatically, and this matching API client to do both the searching and crawling.

Diffbot logo

I’ll also be using my trusty Homestead Improved environment for a clean project, so I can experiment in a VM that’s dedicated to this project and this project alone.

What’s what?

To make a SitePoint search engine, we need to do the following:

  1. Build a Crawljob which will index and process the entire SitePoint.com domain and keep itself up to date with newly published content.
  2. Build a GUI for submitting search queries to the saved set produced by this crawljob. Searching is done via the Search API. We’ll do this in a followup post.

A Diffbot Crawljob does the following:

  1. It spiders a URL pattern for URLs. This does not mean processing - it means looking for links to process on all the pages it can find, starting from the domain you originally passed in as seed. For the difference between crawling and processing, see here.
  2. It processes the pages found on the spidered URLs with the designated API engine - for example, using Product API, it processes all products it found on Amazon.com and saves them into a structured database of items on offer.

Continue reading %Crawling and Searching Entire Domains with Diffbot%


by Bruno Skvorc via SitePoint

No comments:

Post a Comment