Tesseract on Knot
Tesseract is optical character recognition software that attempts to extract text from images (TIFFs, pngs, etc.) or image PDFs (PDFs without searchable text). Currently Tesseract is only installed on knot and only the english language pack is installed.
Version:
tesseract-3.04.00-3.el6.x86_64
tesseract-osd-3.04.00-3.el6.x86_64
Usage:
tesseract imagefile.tiff outputtextfile
or
tesseract imagefile.tiff outputtextfile outputfile.pdf
(which should create a searchable PDF of the original .tiff file)
Tesseract is single-threaded meaning that it only uses 1 core at a time. So when submitting jobs you may want to run multiple tesseracts at once. For instance if we have 120 files we want to process, we can ask for 1 entire 12-core node and then launch 12 instances of tesseract, wait for the slowest to finish, and then launch the next 12 automatically in our job submission script. Note that each iteration will take as long as the slowest file to process. Here is an example of that with filenames that contain incremental indexes in there filenames:
#PBS -l nodes=1:ppn=12
#PBS -l walltime=2:00:00
#PBS -V
cd $PBS_O_WORKDIR
for ((j=1; j<=10;j++))
do
for (( i=1; i<=12; i++))
do
tesseract file.$j.$i output.$j.$i &
done
wait
done