Friday, February 17, 2017

Reading text from Images using Java

This post will help read texts from your images. It makes use of tessaract library.
You can also use the below module to check if the captcha on your site is strong enough and cannot be broken simply.

Reference:
https://github.com/tesseract-ocr/tessdata
http://stackoverflow.com/questions/18095708/tess4j-doesnt-use-its-tessdata-folder

Language Used:
Java

Git Location:
https://github.com/csanuragjain/extra/tree/master/ReadFromImages

POM Dependency:
 <!-- https://mvnrepository.com/artifact/net.sourceforge.tess4j/tess4j -->  
 <dependency>  
   <groupId>net.sourceforge.tess4j</groupId>  
   <artifactId>tess4j</artifactId>  
   <version>3.2.1</version>  
 </dependency>  

Pre-requisite:
1) Assume you are running this program from c:\myprogram. Now you can follow either of 2 methods based on your requirements.

Space saving method: (You only download the language data which you need. Only require 30MB for a english dataset)
2) Create a folder named tessdata inside c:\myprogram\
3) Navigate to https://github.com/tesseract-ocr/tessdata
4) Download eng.traineddata for breaking captcha with english language (trained data are available for other languages as well)
5) Place the eng.traineddata inside tessdata folder.
6) Finally your folder structure should look like c:\myprogram\tessdata\eng.traineddata

Time saving method: (Download trained data from several languages and atleast cosumes 1GB space)
7) You can also skip Step 2 to Step 5 and simply download the tessdata-master folder from https://github.com/tesseract-ocr/tessdata
8) Unzip the content of tessdata-master.zip file in your main project folder (for eg here it is c:\myprogram\)
9) Rename tessdata-master to tessdata
10) Finally your folder structure should look like c:\myprogram\tessdata\<Trained data from several language>

Program:

ImageCracker class, crackImage method:
 public static String crackImage(String filePath) {  
     File imageFile = new File(filePath);  
     ITesseract instance = new Tesseract();  
     try {  
       String result = instance.doOCR(imageFile);  
       return result;  
     } catch (TesseractException e) {  
       System.err.println(e.getMessage());  
       return "Error while reading image";  
     }  
   }  

How it works:
1) crackImage takes the image which need to be read
2) We point a file object to that image
3) We make a Tessaract object named instance
4) We call the predefined method doOCR of Tessaract library passing the file object from step2
5) the doOCR method returns the text read from the image and returns the same.
6) In case of failure it prints the error message and returns a error string.

Driver class, main method:
 public static void main(String[] args) {  
           // TODO Auto-generated method stub  
           System.out.println(ImageCracker.crackImage("testImage.PNG"));  
      }  

How it works:
1) We call the crackImage method passing the image to be read from.
2) We print the text read from the method on the console.

Input Image (testImage.PNG):
Output:
Create a Youtube metadata crawler using Java

Full Program:

ImageCracker class
 package com.cooltrickshome;  
 import java.io.File;  
 import net.sourceforge.tess4j.*;  
 public class ImageCracker {  
   public static String crackImage(String filePath) {  
     File imageFile = new File(filePath);  
     ITesseract instance = new Tesseract();   
     try {  
       String result = instance.doOCR(imageFile);  
       return result;  
     } catch (TesseractException e) {  
       System.err.println(e.getMessage());  
       return "Error while reading image";  
     }  
   }  
 }  

Driver class:
 package com.cooltrickshome;  
 public class Driver {  
      /**  
       * @param args  
       */  
      public static void main(String[] args) {  
           // TODO Auto-generated method stub  
           System.out.println(ImageCracker.crackImage("testImage.PNG"));  
      }  
 }  

Hope it helps :)