Batch-converting JPEG files for OCR

8th March 2008

I was sent a whole bunch of .jpg files of scanned documents with text that I wanted to extract.

I have Microsoft Office Document Imaging (MODI) installed, so I was keen to use that to perform the OCR (instead of re-typing all the text!). The only problem is that MODI only understands TIFF and MDI formats.

I used ImageMagick to do the conversion. Convert might sound like the best candidate, but mogrify did the job for me.

You can convert a whole lot of files using the following command:

mogrify -format tiff *.jpg

This creates new tiff files for each JPEG file. The only problem is that MODI doesn’t like the particular flavour of TIFF generated. Fortunately ImageMagick has 1001 options to configure exactly what you want to happen.

A bit of experimentation and I’ve found that the following extra options generate TIFF files that can be read without problems:

mogrify -format tiff -colorspace RGB -compress RLE *.jpg

All good, except that I then discovered that the scanning was at such a low DPI that the OCR wasn’t able to find any text :-(

Something else that sounds interesting is that MODI can be programmed against. Maybe I could automate this even more!