Optical Character Recognition (OCR) is the process of converting printed text into a digital representation. It has all sorts of practical applications — from digitizing printed books, creating electronic records of receipts, to number-plate recognition and even circumventing image-based CAPTCHAs.
Tesseract is an open source program for performing OCR. You can run it on *Nix systems, Mac OSX and Windows, but using a library we can utilize it in PHP applications. This tutorial is designed to show you how.
Installation
Preparation
To keep things simple and consistent, we’ll use a Virtual Machine to run the application, which we’ll provision using Vagrant. This will take care of installing PHP and Nginx, though we’ll install Tesseract separately to demonstrate the process.
If you want to install Tesseract on your own, existing Debian-based system you can skip this next part — or alternatively visit the README for installation instructions on other *nix systems, Mac OSX (hint — use MacPorts!) or Windows.
Vagrant Setup
To set up Vagrant so that you can follow along with the tutorial, complete the following steps. Alternatively, you can simply grab the code from Github.
Enter the following command to download the Homestead Improved Vagrant configuration to a directory named ocr
:
git clone https://github.com/Swader/homestead_improved ocr
We’re not going to be using Laravel, so change the Nginx configuration in Homestead.yml
from:
sites:
- map: homestead.app
to: /home/vagrant/Code/Laravel/public
…to…
sites:
- map: homestead.app
to: /home/vagrant/Code/public
You’ll also need to add the following to your hosts file:
192.168.10.10 homestead.app
Installing the Tesseract Binary
The next step is to install the Tesseract binary.
Because Homestead Improved uses a Debian-based distribution of Linux, we can use apt-get
to install it after logging into the VM with vagrant ssh
. It’s as simple as running the following command:
sudo apt-get install tesseract-ocr
As I mentioned above, there are instructions for other operating systems in the README.
Continue reading %OCR in PHP: Read Text from Images with Tesseract%
by Lukas White via SitePoint
No comments:
Post a Comment